Proceedings of the 5th Symposium on Knowledge ... - facom.ufu [PDF]

2 Oct 2017 - Por exemplo,. 9http://opus.lingfil.uu.se/OpenSubtitles.php. Symposium on Knowledge Discovery, Mining and Le

24 downloads 51 Views 14MB Size

Recommend Stories


Proceedings of the XI Brazilian Symposium on Information Systems [PDF]
PDF · An Approach to Manage Evaluations Using an Interactive Board Game: The Ludo Educational Atlantis, Thiago Jabur Bittar, Luanna Lopes Lobato, Leandro ... Automatic Tuning of GRASP with Path-Relinking in data clustering with F-R ace and iterated F

Proceedings of the 5th Symposium on Operating Systems Design and Implementation
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Proceedings of the Ottawa Linux Symposium
Make yourself a priority once in a while. It's not selfish. It's necessary. Anonymous

5th Symposium of the TUM-Neuroimaging Center
When you talk, you are only repeating what you already know. But if you listen, you may learn something

Proceedings of the Ottawa Linux Symposium
Respond to every call that excites your spirit. Rumi

Proceedings of the Second HPI Cloud Symposium
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Proceedings of the IASS Symposium 2009, Valencia
What you seek is seeking you. Rumi

Proceedings of the 4th International Symposium on Enhanced Landfill Mining
If you want to become full, let yourself be empty. Lao Tzu

Symposium Proceedings, ISBN
The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

Proceedings The 4th International Symposium of Indonesian Wood [PDF]
Buku Ajar. Produk-Produk Panel Berbahan Dasar Kayu. Badan Penerbit Fakultas. Pertanian Universitas Pattimura, Ambon. ISBN: 978-602-03-0. Kliwon, S; Paribotro dan M. I. Iskandar. 1984. ... Regional Integration of The Wood-Based Industry: Quo Vadis? ht

Idea Transcript


October 2 to 4, 2017 ˜o Faculdade de Computa¸ ca ˆ ndia Universidade Federal de Uberla ˆ ndia, MG, Brazil Uberla

PROCEEDINGS OF THE 5TH SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING

Elaine Ribeiro de Faria Paiva, Luiz Merschmann, Ricardo Cerri (Eds.)

5th SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING October 2 to 4, 2017 Uberlˆandia – MG – Brazil

PROCEEDINGS Organization Faculdade de Computa¸ca˜o – FACOM-UFU Universidade Federal de Uberlˆandia– UFU

Local Organization Chair Elaine Ribeiro de Faria Paiva, UFU Program Committee Chairs Luiz Merschmann, UFLA Ricardo Cerri, UFSCAR Steering Committee Chair Alexandre Plastino, UFF

Support Brazilian Computer Society – SBC ISSN: 2318-1060

Dados Internacionais de Catalogação na Publicação (CIP)

S989k

Symposium on knowledge discovery, mining and learning (KDMile) (5. : 2017 : Uberlândia, MG, Brazil) Proceedings [do] 5nd Symposium on knowledge Discovery, mining and learning (KDMile), October, 2nd a 4th, 2017, Uberlândia, Minas Gerais; organizadores: Faculdade de Computação – FACOM, Universidade Federal de Uberlândia. - Uberlândia: SBC, 2017.

Inclui bibliografia. 192 p.: il. Modo de acesso: http://www.facom.ufu.br/~kdmile/ 1. Computação - Congressos. 2. Mineração de dados (Computação) - Congressos. 3. Aprendizado do computador – Congressos. I. Paiva, Elaine Ribeiro de Faria II. Merschmann, Luiz. III. Cerri, Ricardo IV. Symposium on knowledge Discovery, mining and learning (KDMile) (5.: 2017 : Uberlândia, MG, Brazil) IV. Universidade Federal de Uberlândia. V. Sociedade Brasileira de Computação. VI. Título.

CDU: 681.3(061.3) Elaborado pelo Sistema de Bibliotecas da UFU / Setor de Catalogação e Classificação

Editorial The Symposium on Knowledge Discovery, Mining and Learning (KDMiLe) aims at integrating researchers, practitioners, developers, students and users to present theirs research results, to discuss ideas, and to exchange techniques, tools, and practical experiences – related to Data Mining and Machine Learning areas. KDMiLe is organized alternatively in conjunction with the Brazilian Conference on Intelligent Systems (BRACIS) and the Brazilian Symposium on Databases (SBBD). In its fifth edition, KDMiLe is held in Uberlˆandia, Minas Gerais, from October 2nd to 4th in conjunction with SBBD. In 2017, for the first time, SBBD will be held in conjunction with the BRACIS (Brazilian Conference on Intelligent Systems). This year’s edition KDMiLe features one short course, one invited talk, one meet-up, one industrial panel, and one competition. The short course, titled “How Deep Learning Works”, is presented by Prof. Moacir Ponti from ICMC-USP – Brazil. We invited Prof. Marcos Gon¸calves (University of Minas Gerais, Brazil) to present a talk on “Sentiment Analysis in Social media: Challenges and Solutions”. The meet-up titled “Women in Data Science ” is organized by Ana Paula Appel (IBM Research). The industrial panel and the 1st Brazilian Knowledge Discovery in Databases competition (1st KDD-BR) were organized in conjunction with SBBD and BRACIS. The event received in 2017 a total of 55 manuscripts, of which 25 were selected for oral presentation after a rigorous reviewing process. This corresponds to an acceptance rate of 45%. The papers are distributed into seven technical sessions, where authors will present and discuss their work with the audience. We thank SBBD Organization Committee for hosting KDMiLe at FACOM-UFU and also our sponsors for their valuable support. We are also grateful to the Program Committee members for carefully evaluating the submitted papers. Finally, we give our special thanks to all the authors who submitted their research work to KDMiLe and contributed to a yet another high quality edition of this ever growing event in Data Mining and Machine Learning. Uberlˆandia, October 2, 2017 Elaine Ribeiro de Faria Paiva, UFU KDMiLe 2017 Local Organization Chair Luiz Merschmann, UFLA KDMiLe 2017 Program Committee Chair Ricardo Cerri, UFSCAR KDMiLe 2017 Program Committee Co-Chair

5th Symposium on Knowledge Discovery, Mining and Learning October 2-4, 2017 Uberlˆandia – MG – Brazil Organization Faculdade de Computa¸ca˜o – FACOM-UFU Universidade Federal de Uberlˆandia – UFU

Support Brazilian Computer Society – SBC

KDMiLe Steering Committee Alexandre Plastino, UFF Andr´e Ponce de Leon F. de Carvalho, ICMC-USP Wagner Meira Jr., UFMG

KDMiLe 2017 Committee Local Organization Chair Elaine Ribeiro de Faria Paiva, UFU Program Committee Chairs Luiz Merschmann, UFLA Ricardo Cerri, UFSCAR Steering Committee Chair Alexandre Plastino, UFF

KDMiLe Program Committee Alexandre Plastino (Universidade Federal Fluminense) Ana Carolina Lorena (Universidade Federal de S˜ao Paulo) Ana Paula Appel (IBM Research Brazil) Andre de Carvalho (University of S˜ao Paulo) Andr´e L. D. Rossi (Universidade Estadual Paulista J´ ulio de Mesquita Filho) Angelo Ciarlini (EMC Brazil R&D Center) Anisio Lacerda (Centro Federal de Educa¸ca˜o Tecnol´ogica de Minas Gerais) Aurora Pozo (Federal University of Paran´a) Bianca Zadrozny (IBM Research Brazil) Bruno M. Nogueira (Federal University of Mato Grosso do Sul) Carlos N. Silla Jr. (Pontifical Catholic University of Parana (PUCPR)) Cl´audia Galarda Varassin (UFES) Daniela Godoy (ISISTAN Research Institute) Edson Matsubara (UFMS) Elaine Ribeiro de Faria (Federal University of Uberlˆandia) Elaine Sousa (University of S˜ao Paulo - ICMC/USP) Fabio Cozman (Universidade de S˜ao Paulo) Fernando Otero (University of Kent) Francisco de A. T. de Carvalho (Centro de Informatica - CIn/UFPE) Frederico Durao (Federal University of Bahia) Gisele Pappa (UFMG) Helena Caseli (Federal University of S˜ao Carlos - UFSCar) Heloisa Camargo (Universidade Federal de S˜ao Carlos) Humberto Luiz Razente (Universidade Federal de Uberlˆandia - UFU) Joao Paulo Papa (UNESP - Univ Estadual Paulista) Jonathan de Andrade Silva (University of Mato Grosso do Sul) Jonice Oliveira (UFRJ) Jose Alfredo Ferreira Costa (Federal University - UFRN) Jose Viterbo (UFF) Julio Cesar Nievola (Pontif´ıcia Universidade Cat´olica do Paran´a - PUCPR) Karin Becker (UFRGS) Kate Revoredo (UNIRIO) Leandro Balby Marinho (Federal University of Campina Grande - UFCG) Leonardo Rocha (Federal University of S˜ao Jo˜ao Del Rei) Liang Zhao (University of S˜ao Paulo) Luis Z´arate (PUC-MG) Luiz Fernando Coletta (UNESP) Luiz Martins (Universidade Federal de Uberlandia) Luiz Merschmann (Federal University of Lavras) Maira Gatti de Bayser (IBM Research) Marcelino Pereira (Universidade do Estado do Rio Grande do Norte - UERN) Marcelo Albertini (Federal University of Uberlandia) Marcilio de Souto (LIFO/University of Orleans) Marcio Basgalupp (ICT-UNIFESP) Marcos Goncalves (Federal University of Minas Gerais)

Marcos Quiles (Federal University of S˜ao Paulo) Maria Camila Nardini Barioni (Universidade Federal de Uberlˆandia) Mari´a Nascimento (Universidade de S˜ao Paulo) Murillo G. Carneiro (Federal University of Uberlˆandia) Murilo Naldi (Universidade Federal de S˜ao Carlos) Renato Tin´os (USP) Ricardo Cerri (Federal University of Sao Carlos) Ricardo Prudencio (Informatics Center - UFPE) Rodrigo Barros (PUCRS) Ronaldo Prati (Universidade Federal do ABC - UFABC) Rui Camacho (LIACC/FEUP University of Porto) Solange Rezende (Universidade de S˜ao Paulo) Wagner Meira Jr. (UFMG)

External Reviewers Antonela Tommasel Pablo N. Da Silva Camila Santos Allan Sales Marcos Roberto Ribeiro Fabio Rangel Ricardo Oliveira Pablo Jaskowiak Marcos Cintra Jonnathan Carvalho Desiree M. Carvalho Christian Cesar Bones Salety Ferreira Baracho

Table of Contents Chatbot baseado em Deep Learning: um Estudo para Lingua Portuguesa . . . . . . . . . 11 Andherson Maeda (PUC-RS) and Silvia Moraes (PUC-RS)

A Review of Text-Based and Knowledge-Based Semantic Similarity Measures . . . . . 19 Angelica Ribeiro (USP), Zhao Liang (USP) and Alessandra Macedo (USP)

Parameter Learning in ProbLog with Probabilistic Rules . . . . . . . . . . . . . . . . . . . . . . . . . 27 Arthur Colombini Gusm˜ao (USP), Francisco Henrique Otte Vieira de Faria (USP), Glauber De Bona (USP), Fabio Gagliardi Cozman (USP) and Denis Deratani Mau´ a (USP)

A Dispersion-Based Discretization Method for Models Explanation . . . . . . . . . . . . . . . 35 Bernardo Stearns (UFRJ), Fabio Rangel (UFRJ), Fabr´ıcio Faria (UFRJ) and Jonice Oliveira (UFRJ)

Um M´etodo para Predi¸ca˜o de Liga¸co˜es em Redes Complexas Baseado em Hist´oricos da Topologia de Grafos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 ´ Erick Florentino (IME) and Ronaldo Ribeiro Goldschmidt (IME)

Um m´etodo para identifica¸ca˜o de pessoas em cen´arios de risco em ambientes de seguran¸ca cr´ıtica – uma an´alise experimental em ambientes offshore . . . . . . . . . . . . . . . . . . . 47 Felipe Oliveira (UFF), Flavia Bernardini (UFF) and Marcilene Viana (UFF)

Acoplamento para resolu¸ca˜o de correferˆencia em ambiente de aprendizado sem-fim 55 Felipe Quecole (UFSCar), Maisa Cristina Duarte (Universit´e Jean Monnet) and Estevam Rafael Hruschka Jr (UFSCar)

A Machine Learning Predictive System to Identify Students in Risk of Dropping Out of College . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Gabriel Silva (UnB) and Marcelo Ladeira (UnB)

Classifica¸ca˜o Hier´arquica e N˜ao Hier´arquica de Elementos Transpon´ıveis . . . . . . . . . . 69 Gean Pereira (UFSCar) and Ricardo Cerri (UFSCar)

Using graph-based centrality measures for sentiment analysis . . . . . . . . . . . . . . . . . . . . . 77

George Vilarinho (USP), Mateus Machado (USP) and Evandro Ruiz (USP)

A novel probabilistic Jaccard distance measure for classification of sparse and uncertain data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Igor Martire (UFF), Pablo Nascimento Da Silva (UFF), Alexandre Plastino (UFF), Fabio Fabris (University of Kent) and Alex Freitas (University of Kent)

Improving Activity Recognition using Temporal Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Jo˜ao Paulo Aires (PUC-RS), Juarez Monteiro (PUC-RS), Roger Granada (PUC-RS), Felipe Meneguzzi (PUC-RS) and Rodrigo Barros (PUC-RS)

Label Powerset for Multi-label Data Streams Classification with Concept Drift . . . .97 Joel Costa (UFSCar), Elaine Ribeiro de Faria (UFU), Jonathan Andrade Silva (UFMS) and Ricardo Cerri (UFSCar)

Using Scene Context to Improve Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Leandro Pereira Da Silva (PUC-RS), Roger Granada (PUC-RS), Juarez Monteiro (PUC-RS) and Duncan Dubugras Alcoba Ruiz (PUC-RS)

An Empirical Comparison of Hierarchical and Ranking-Based Feature Selection Techniques in Bioinformatics Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Luan Rios Campos (UEFS) and Matheus Giovanni Pires (UEFS)

SOUTH-N: um m´etodo para a detec¸ca˜o semisupervisionada de outliers em dados de alta dimens˜ao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Lucimeire Alves Da Silva (UFU), Maria Camila Nardini Barioni (UFU) and Humberto Luiz Razente (UFU)

Learning Probabilistic Relational Models: A Simplified Framework, a Case Study, and a Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Luiz Henrique Mormille (USP) and Fabio Gagliardi Cozman (USP)

Identifica¸ca˜o de Candidatos a Fiscaliza¸ca˜o por Evas˜ao do Tributo ISS . . . . . . . . . . . 137 Marcelo Dias (UFRGS) and Karin Becker (UFRGS)

Estrat´egias de Corre¸c˜ao de Erros de Extratores de Palavras em Portuguˆes . . . . . . . 145 Matheus Nogueira (UFES) and Elias Oliveira (UFES)

A Deep Learning Approach to Prioritize Customer Service Using Social Networks 153

Paulo Amora (UFC), Elvis Teixeira (UFC), Maria Lima (UFC), Gabriel Amaral (UFC), Jos´e Cardozo (Digitro Tecnologia SA) and Javam Machado (UFC)

TATModel - Em Dire¸c˜ao a um Novo Modelo para Avalia¸ca˜o de Tradu¸co˜es Autom´aticas de Texto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Rafael Guimar˜aes Rodrigues (CEFET-RJ) and Gustavo Paiva Guedes (CEFET-RJ)

Um M´etodo de Aprendizado Multirr´otulo baseado em Aprendizado N˜ao-Supervisionado Hier´arquico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Rodrigo Rodovalho (UFF) and Flavia Bernardini (UFF)

Extra¸c˜ao autom´atica de sementes para sistemas de aprendizado sem-fim . . . . . . . . . 173 Rom˜ao Matheus Martines de Jesus (UFSCar), Maisa Cristina Duarte (Universit´e Jean Monnet) and Estevam Rafael Hruschka Jr (UFSCar)

SVM Cascata para o Problema de Predi¸ca˜o de S´ıtio de In´ıcio de Tradu¸ca˜o . . . . . . . 177 Wallison Guimaraes (PUC-MG), Cristiano Pinto (EMGE), Cristiane Nobre (PUCMG) and Luis Z´arate (PUC-MG)

AILINE-Um M´etodo Inteligente para Detec¸c˜ao Autom´atica de Linhas Espectrais em Gal´axias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Yvson P. N. Ferreira (PGCA/UEFS), Iranderly F. de Fernandes (PGCA/UEFS) and Angelo C. Loula (PGCA/UEFS)

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Chatbot baseado em Deep Learning: um Estudo para Língua Portuguesa A. C. Maeda, S. M. W. Moraes Pontifícia Universidade Católica do Rio Grande do Sul, Brazil [email protected], [email protected] Abstract. Interest in chatbots has intensified, especially in commercial applications. Despite this, development of these conversational agents still faces many challenges, such as the automatic construction of their knowledge bases. Solutions based on recurrent neural networks have emerged as a way to minimise this problem. In this paper we present our study and we discuss the results obtained with the model sequence-to-sequence in the construction of a chatbot for Portuguese Language. Categories and Subject Descriptors: I.2.6 [Artificial Intelligence]: Learning Keywords: Chatbots, Conversational Agents, Machine Learning

1.

INTRODUÇÃO

Agentes conversacionais, também conhecidos como sistemas de diálogo, são programas que se comunicam com os usuários em linguagem natural (por texto, por voz ou por ambas as formas)[Jurafsky and Martin 2009]. Nos últimos anos, o interesse comercial em agentes conversacionais vem aumentando. A capacidade desses agentes de manter um diálogo em linguagem natural os torna muito atrativos. Eles atuam como facilitadores em vários sentidos, tanto provendo uma forma de comunicação mais eficiente entre empresas e seus clientes (informação e retorno rápido a qualquer hora), quanto provendo uma interface mais natural para diferentes aplicativos ao substituir menus por entradas de texto ou de voz. Grandes empresas como a Apple (Siri1 ), a Microsoft (Cortana2 ), Samsung (Bixby3 ), Amazon (Alexa4 ) e a Google (Google Now5 ) têm lançado produtos que caminham nessa direção. O interesse nesses agentes têm movimentado áreas de pesquisa como aprendizagem de máquina e processamento de linguagem natural. Atualmente, os agentes conversacionais classificam-se basicamente em dois tipos: agentes de diálogo orientado a tarefas e chatbots (ou chatter bots). Os agentes orientados a tarefas possuem arquitetura cognitiva e, portanto, são baseados em meta. Eles atendem a solicitações de tarefas que o usuário expressa em linguagem natural. Esses agentes incluem os assistentes pessoais existentes em celulares, por exemplo. Já os chatbots possuem arquitetura reativa e não atendem a tarefas específicas, apenas imitam a conversação humana [Luger and Sellen 2016]. São projetados para domínios abertos. Esse é o caso do nosso trabalho. Apesar dos diversos frameworks existentes hoje para o desenvolvimento de chatbots, ainda há muito esforço manual envolvido na construção desses agentes. É importante ressaltar que esse esforço 1 http://www.apple.com/br/ios/siri/ 2 https://www.microsoft.com/pt-br/windows/cortana 3 https://techroad.com.br/noticias/2017/03/samsung-bixby-inteligencia-artificial-interface.html 4 https://developer.amazon.com/alexa 5 https://www.androidcentral.com/google-now

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

11

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A. C. Maeda and S. M. W. Moraes

aumenta ainda mais quando se trata da Língua Portuguesa. Pouco frameworks dão suporte ao Português e, quando o disponibilizam, são insuficientes para o processamento automático dessa língua. Em função dessas dificuldades e limitações, novas abordagens têm surgido. A Google, por exemplo, em [Vinyals and Le 2015], propõe o uso de uma rede neural sequence to sequence [Sutskever et al. 2014] para definir a base de conhecimento de um agente conversacional. Naquele trabalho, o agente "aprende a responder" a partir de um corpus de diálogos. Nosso estudo tem como base o trabalho de Vinyals et. al. [Vinyals and Le 2015; Sutskever et al. 2014] e foi aplicado para a Língua Portuguesa. A exemplo daquele trabalho, usamos corpora baseados em legendas de filmes e também em mensagens, mas, no nosso caso, de um aplicativo de celular. Nossa pesquisa mostra as dificuldades de trabalhar com a Língua Portuguesa e os resultados obtidos com a abordagem usada.

2.

TRABALHOS RELACIONADOS

Nessa seção comentamos alguns trabalhos recentes relacionados a construção da agentes conversacionais. Alguns trabalhos se assemelham com o nosso por extraírem informações a partir de corpora em português e outros por usarem abordagens baseadas em redes neurais. É importante destacar que existem muitos agentes conversacionais para Língua Portuguesa, no entanto são poucos os trabalhos que visam a construção automática da base de conhecimento desses agentes. Um desses trabalhos é o de Souza e Moraes (2015) que descrevem uma abordagem de construção automática da base de conhecimento em português através da extração de informações a partir de Frequently Asked Questions (FAQ). A vantagem do uso de uma FAQ como corpus é sua organização em perguntas e respostas. Naquele trabalho, as perguntas e respostas extraídas foram anotadas com informações morfossintáticas por meio do parser VISL6 . Com base nessas anotações, foram escolhidos como tokens mais representativos aqueles cujas tags sintáticas fossem pronomes interrogativos, sujeito, verbo principal e seus complementos. A abordagem usada mostrou a viabilidade na construção automática da base de conhecimento para chatbots em AIML a partir de texto. É importante destacar também que, até o momento, não encontramos chatbots para português baseados em deep learning. Por isso, descrevemos como trabalho relacionado o de Sutskever e Vinyals (2015), cujo estudo é para a Língua Inglesa. Esses pesquisadores apresentam um modelo de chatbot neural baseado no framework sequence to sequence [Sutskever et al. 2014]. Esse modelo é interessante pela capacidade de mapear sequências com poucos ajustes e ser independente de domínio. Se observarmos a natureza complexa, sequencial e temporal de um diálogo, o mapeamento e modelagem de uma base de diálogo para um domínio é limitada, exigindo um trabalho razoável de engenharia [Sutskever et al. 2014]. Para o treinamento foram usados dois corpora, ambos em inglês: um corpus textual constituído de legendas de filmes e o outro com diálogos do chat de suporte técnico da Google. Os dois corpora foram preprocessados da mesma forma. Houve remoção de tokens indesejados, tais como: URLs, nomes comuns, marcações XML e caracteres não relacionados a conversação. No corpus formado por legendas de filmes, os interlocutores não estão indicados. Para resolver esse problema, os autores usam a seguinte heurística : a próxima frase é a resposta da frase atual. Diferentemente do corpus de suporte técnico, a quantidade de dados no corpus de filmes é maior e ele contém dados mais ruidosos pois diversas frases consecutivas podem ter sido ditas pelo mesmo personagem numa única fala do filme. Os resultados mais significativos desse chatbot foram a capacidade de responder a perguntas que não estavam presentes no conjunto de treino e a possibilidade de utilizar corpus de domínios diferentes. Numa comparação7 feita por humanos entre o modelo neural proposto e CleverBot8 quanto a 200 perguntas, o modelo conseguiu gerar 97 respostas adequadas em relação a apenas 60 do CleverBot. 6 Anotador

morfossintático online para português - https://visl.sdu.dk/ quocle/QAresults.pdf 8 http://www.cleverbot.com/ 7 http://ai.stanford.edu/˜

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

12

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Chatbot baseado em Deep Learning: um Estudo para Língua Portuguesa

·

3

2.1 Corpora utilizados A Língua Portuguesa é considerada uma low-resource language. Existem corpora, mas nem sempre estão disponíveis e possuem a mesma qualidade e tamanho daqueles existentes para a Língua Inglesa, que é uma high-resource language [Hirschberg and Manning 2015]. Precisavamos de um corpus de diálogo preferencialmente de um mesmo domínio para usar em nosso estudo. Como não encontramos um corpus amplamente conhecido (benchmark), disponível e grande para a Língua Portuguesa, construímos esse recurso. Primeiramente, definimos um corpus com 12.775 sentenças extraídas de legendas da sequência de filmes de ficção científica Star Wars. O corpus, composto por 77.029 tokens (429KB), é formado pelas legendas dos filmes "Guerra nas Estrelas" (1977), "O Império contra Ataca" (1980), "O Retorno de Jedi" (1983), "A Ameaça Fantasma" (1999), "O Ataque dos Clones" (2002) e "A Vingança dos Sith" (2005). Ele foi extraído do repositório OpenSubtitles 9 . Um dos nossos maiores problemas com esse corpus foi a ausência de informação. As falas dos personagens não estavam identificadas, ou seja, não sabíamos qual personagem do filme tinha proferido cada sentença. Para resolver esse problema, usamos a mesma heurística utilizada em [Vinyals and Le 2015]. Consideramos cada nova sentença como um novo turno. Assumimos que cada sentença era proferida por um personagem diferente. Desta forma, a sentença atual foi utilizada como pergunta e a subsequente, como resposta. O outro corpus usado é de bate-papo e é composto por 30.684 sentenças, cerca de 309.600 tokens (1,8 MB). Corresponde a um subconjunto de um histórico de 3 anos de diálogos entre duas pessoas de uma mesma família. Esse histórico contém trocas de mensagens feitas a partir de um aplicativo para celular. Nesse corpus, cada interação (sentença) tem a informação do interlocutor, data e hora. Embora esse corpus contenha diálogos entre apenas duas pessoas, as mensagens delas não estão necessariamente intercaladas. Nem sempre há uma troca de interlocutor entre uma mensagem e outra. É comum que uma mesma pessoa poste várias mensagens antes da interação da outra pessoa. Para viabilizar o processamento da rede neural, foi necessário identificar o interlocutor juntamente com sua mensagem e agrupar aquelas que foram postadas sequencialmente pela mesma pessoa. O objetivo era formar uma única entrada contenho todas essas mensagens (sentenças) de uma mesma pessoa antes da troca de turno (de interlocutor). Esse pré-processamento resultou em um corpus com 15.659 frases, 120.940 tokens, gerando um arquivo de 632 KB. Embora links e alguns caracteres especiais não tenham sido removidos, essa redução em tamanho aconteceu porque parte do corpus original continha mensagens referentes a imagens e vídeos, as quais foram descartadas durante a etapa de pré-processamento. 3.

REDE NEURAL SEQUENCE TO SEQUENCE

O modelo Sequence to Sequence, ou seq2seq, foi proposto inicialmente por Sutsvekever et al. (2014). Na época, os autores mostraram que uma arquitetura utilizando redes Long short-term memory (LSTM) era capaz de traduzir textos em francês para inglês com desempenho comparável ao estado da arte na área de tradução automática[Sutskever et al. 2014]. A arquitetura proposta contorna problemas de outras abordagens não-recorrentes nas quais era necessário definir previamente a dimensionalidade das entradas e saídas, e também de abordagens recorrentes, que exigiam sequências de entrada com longas dependências temporais [Hochreiter and Schmidhuber 1997]. A abordagem de [Sutskever et al. 2014] é bastante similar a utilizada por [Cho et al. 2014], no entanto, ao invés de múltiplas camadas totalmente conectadas, o modelo é formado por duas partes, chamadas de encoder e decoder. O encoder recebe a sequência de entrada e mantém o contexto dessa entrada. Esse contexto é utilizado pelo decoder para gerar a sequência de saída. Diferentemente de [Cho et al. 2014], o encoder e decoder utilizam 4 camadas de LSTM [Sutskever et al. 2014]. A tradução inicia com a passagem de cada elemento temporal da sequência de entrada A B C (1) na rede. No final da sequência, um token de fim de sentença () é passado para informar ao modelo que o processo de entrada terminou e, assim, a rede pode iniciar a geração da sequência de saída (tokens W X Y Z). Essa geração de saída é feita utilizando o token predito anteriormente como entrada para o próximo token. Por exemplo, 9 http://opus.lingfil.uu.se/OpenSubtitles.php

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

13

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A. C. Maeda and S. M. W. Moraes

Fig. 1.

Exemplo da resolução de sequências de entrada e saıida no modelo seq2seq

na figura 1, o token W é utilizado como entrada para a geração do token X e, assim, sucessivamente [Sutskever et al. 2014]. 4.

ARQUITETURA DO AGENTE CONVERSACIONAL

A arquitetura do agente (Figura 2) está dividida em treinamento e conversação. No treinamento, os módulos de pré-processamento são responsáveis por tratar o corpus e gerar os tensores de treino para a rede seq2seq. O módulo rede neural é o modelo seq2seq em si. Durante a interação com o usuário, os módulos de conversação são utilizados para carregar o modelo induzido e gerar respostas para as entradas do usuário. A seguir, descrevemos cada módulo da arquitetura.

Fig. 2.

Arquitetura do agente conversacional

4.1 Pré-processamento dos Corpora Na fase de treinamento, o pré-processamento dos corpora foi dividido em duas etapas, sendo que a primeira varia de acordo com o formato do dataset em questão. Esse pré-processamento inicial já foi mencionado. No caso do corpus Star Wars refere-se à aplicação de uma heurística para definir os turnos do diálogo. Já, no caso do corpus de bate-papo, consiste no agrupamento das mensagens consecutivas de um mesmo interlocutor. Feito isso, o passo seguinte corresponde ao que chamamos no pré-processamento de pré-treino. Ele consiste em tokenizar as sentenças, convertê-las para caixa baixa (normalização) e eliminar os acentos das palavras (para evitar problemas de quebra de encoding na geração das respostas). Em seguida, usamos essas palavras para gerar os tensores da rede. Esses tensores são formados pelos índices das palavras que estão em cada sentença até encontrar um ponto final. Os índices são referentes a um dicionário que contém todas as palavras do dataset. Em nosso estudo, nenhuma técnica de redução de dimensionalidade das entradas foi incluída. Faz parte desse pré-processamento também, inverter a ordem da sentença que será utilizada como a resposta do agente (essa heurística foi sugerida por [Sutskever et al. 2014] e conforme esses autores gerou bons resultados). Além disso, acrescentamos ao final dessa sentença o índice do token de controle que indica fim de frase (). Da mesma forma, na sentença que refere-se à pergunta, adicionamos no seu inicio o índice do token de controle que indica o começo de sentença (). Esse pré-processamento é realizado Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

14

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Chatbot baseado em Deep Learning: um Estudo para Língua Portuguesa

·

5

Table I. Tabela de hiperparâmetros e valores utilizados para o treinamento da rede dataset 10000 tamanho aproximado do dataset para ser utilizado (0 = tudo) maxVocabSize 0 tamanho máximo de palavras no vocabulário (0 = sem limite) cuda true usar CUDA hiddenSize 1000 número de unidades por camada LSTM learningRate 0.001 taxa de aprendizagem no tempo t=0 gradientClipping 5 momento de clipagem de gradiente momentum 0.9 momentum minLR 0.00001 taxa mínima de aprendizagem saturateEpoch 20 época na qual a taxa de decaimento linear atingirá o minLR maxEpoch 50 número máximo de épocas durante o treino batchSize 10 tamanho do minibatch

uma única vez antes do treinamento da rede. Já para a fase de conversação, é importante destacar que nas entradas do usuário podem existir palavras que não estão nesse dicionário. Nesse caso, a palavra inexistente é substituída por um indicador que a marca como desconhecida e ela não é incluída no tensor. Sendo assim, a geração do tensor de entrada da rede acaba produzindo um vetor esparso. Nesse vetor, os valores diferentes de zero correspondem aos índices do vocabulário relativos às palavras que estão na frase; já os valores zero, indicam a ausência da palavra do vocabulário na frase. O tamanho do vetor é igual ao total de palavras únicas (distintas) do vocabulário dataset. 4.2 Rede Neural O módulo Rede Neural é a rede em si, sendo responsável por receber os tensores de treino do módulo de pré-processamento. O modelo utilizado é o apresentado por [Sutskever et al. 2014] com duas camadas distintas de LSTM, uma no encoder e outra no decoder. Hiperparâmetros como quantidade de neurônios, taxa de aprendizagem, clipagem de gradiente e tamanho do minibatch [Ruder 2016] são parametrizados no início do treinamento (ver Tabela I). Cabe mencionar ainda que o tensor predito pela rede tem o mesmo tamanho do vocabulário (de entrada) onde os índices representam as palavras e o valor, a probabilidade de cada uma delas, sendo a mais provável escolhida como palavra desejada. Os critérios de parada na geração das palavras seguem o modelo seq2seq descrito por [Sutskever et al. 2014]. Algoritmos de Deep Learning exigem poder de processamento. Sendo assim, para treinar a rede foi utilizada uma máquina virtual p2x.large10 Ubuntu 16.04.1 LTS com 4 vCPU (virtual CPU) e placa gráfica Nvidia Tesla K80 na nuvem da Amazon11 . A rede foi treinada por 50 épocas (1 minuto de processamento, em média, para cada época) . 4.3 Interface Web As perguntas são feitas pelo usuário ao agente conversacional através do módulo Interface. Esse módulo é responsável por receber as perguntas do usuário, repassar ao módulo Agente Neural, receber as respostas do módulo e mostrar para o usuário. O layout de interação com o usuário pode ser conferido na Figura 3. 5.

TESTES E ANÁLISE DE RESULTADOS

O primeiro teste que realizamos foi para o corpus Star Wars. Após o treinamento da rede com esse corpus, nós fizemos um teste de avaliação junto a usuários para medir a coerência semântica do chatbot durante a conversação. Para isso, seguimos a metodologia descrita em [Souza and Moraes 2015]. Apresentamos aos usuários um questionário de pré-teste para identificar o perfil desses usuários 10 https://aws.amazon.com/pt/blogs/aws/new-p2-instance-type-for-amazon-ec2-up-to-16-gpus/ 11 https://aws.amazon.com/pt/

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

15

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A. C. Maeda and S. M. W. Moraes

Fig. 3.

Imagem da Interface Web

e um questionário de pós-teste para que esses mesmos usuários avaliassem o desempenho do agente durante o diálogo. A avaliação foi feita por 90 usuários, que interagiram com o agente via uma interface web. Os respondentes tinham idade variando entre 17 e 62 anos, sendo que 55,56% dos usuários tinham idade entre 21 e 29 anos. A maioria dos usuários trabalhava na área de Tecnologia da Informação, cerca de 81,1%. Quase a totalidade dos usuários (95,6%) estava cursando ou já concluiu um curso superior. Destes usuários, 83,3% conheciam agentes conversacionais, 71% já interagiram com algum desses agentes e 77,8% conhecia ou já assistiu algum filme da franquia Star Wars. A interação dos usuários com o agente conversacional gerou 1.502 pares de conversa. Os usuários de 21 a 29 anos foram os que mais interagiram com o agente. Sendo que a média ficou em 16,68 mensagens por usuário. Em particular, um diálogo chamou a atenção, aquele em que o usuário forneceu 109 entradas ao agente (Figura 4). As respostas geradas tiveram um nível de coerência maior em relação aos demais diálogos. Esse usuário testou a reação do agente para entradas distantes das falas usuais do filme ainda que no tema. O agente gerou algumas respostas no contexto, apresentando uma certa coerência semântica. Para a mesma entrada, o agente gerou sempre a mesma resposta. O questionário de pós-teste foi respondido por apenas 69 usuários que relataram pouca coerência nas respostas do agente conversacional (Figura 5). Apesar disso, a maioria das respostas estavam dentro do contexto Star Wars. Os usuários também notaram a falta de continuidade do diálogo. Isso já era esperado para modelos baseados em aprendizagem de máquina. Essas abordagens apresentam limitações em diálogos longos [Chakrabarti and Luger 2015; Manning 2015]. Como os resultados não foram os esperados, levantamos algumas hipóteses que pudessem justificar a causa do mau desempenho do agente, tal como: topologia ou os hiperparâmetros da rede inadequados, tamanho do corpus ou a heurística usada para identificar os turnos dos interlocutores. Após alguns testes empiricos, mudando a topologia, os hiperparâmetros e mesmo o tamanho do corpus, percebemos que a heurística usada era a mais provável causa. Para validar essa hipótese, procuramos por um corpus que contivesse a identificação dos interlocutores. Utilizamos, então, o histórico de 3 anos de conversas entre duas pessoas de uma mesma família. Esse histórico, como já descrito, foi extraído de um aplicativo de bate-papo para celular. A topologia e configuração utilizadas foram as mesmas da Tabela I. O agente conversacional se comportou bem, gerando respostas semanticamente mais coerentes que os testes com o corpus de legenda de filmes. Durante esse teste, percebemos que o agente se comportava como um dos interlocutores dependendo da forma como as conversas aconteciam. Na Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

16

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Chatbot baseado em Deep Learning: um Estudo para Língua Portuguesa

Fig. 4.

Fig. 5.

Fig. 6.

·

7

Exemplo de diálogo sobre tema Star Wars.

Opinião dos usuários após a interação com o agente.

Trecho de diálogo do agente usando o corpus de bate-papo.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

17

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A. C. Maeda and S. M. W. Moraes

Figura 6, agente responde de forma similar ao interlocutor 1. Com esse teste ficou evidente que a heurística usada para determinar os turnos não é satisfatória e é uma das razões que levaram o agente a gerar respostas incoerentes. 6.

CONCLUSÃO E TRABALHOS FUTUROS

Os resultados que obtivemos com o corpus de bate-papo se assemelham aos relatados por [Vinyals and Le 2015] ao usar como base de treino diálogos de um chat de suporte de TI. No entanto, o mesmo não aconteceu ao utilizarmos as legendas de filmes. O nível de coerência semântica apresentado no trabalho da Google não se repetiu em nossos testes. Provavelmente, a heurística usada para definir os turnos dos interlocutores foi uma das principais responsáveis pelo baixo desempenho. Tivemos que conviver com a ausência de corpora de diálogo para língua portuguesa com as caracteristicas necessárias para esse estudo. Apesar disso, durante os testes notamos algumas características do modelo. No dataset de bate-papo para celular, percebemos traços de personalidade dependendo do interlocutor que estava interagindo com o agente. É provável que a rede esteja aprendendo o padrão de escrita (forma e vocabulário usual) dos interlocutores cujas entradas estão no dataset. Além disso, o tempo de geração de respostas para entradas mais longas foi maior do que o tempo para entradas mais curtas. Pretendemos melhorar a abordagem, usando técnicas para redução de dimensionalidade e técnicas de processamento de linguagem natural no pré-processamento dos dados, bem como testar com outros corpora. 7.

AGRADECIMENTOS

Nosso agradecimento a PUCRS (EDITAL N. 01/2016 - Programa de Apoio à Atuação de Professores Horistas em Atividades de Pesquisa) pelo apoio financeiro. REFERENCES Chakrabarti, C. and Luger, G. F. Artificial conversations for customer service chatter bots. Expert Syst. Appl. 42 (20): 6878–6897, Nov., 2015. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 , 2014. Hirschberg, J. and Manning, C. D. Advances in natural language processing. Science 349 (6245): 261–266, 2015. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation 9 (8): 1735–1780, 1997. Jurafsky, D. and Martin, J. H. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009. Luger, E. and Sellen, A. "like having a really bad pa": The gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. CHI ’16. ACM, New York, NY, USA, pp. 5286–5297, 2016. Manning, C. D. Computational linguistics and deep learning. Comput. Linguist. 41 (4): 701–707, Dec., 2015. Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 , 2016. Souza, L. S. d. and Moraes, S. M. W. Construção automática de uma base aiml para chatbot: um estudo baseado na extração de informações a partir de faqs. In Encontro Nacional de Inteligencia Artificial e Computacional BRACIS, 2015. Sutskever, I., Vynyaks, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., pp. 3104–3112, 2014. Vinyals, O. and Le, Q. A neural conversational model. arXiv preprint arXiv:1506.05869 , 2015.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

18

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A Review of Text-Based and Knowledge-Based Semantic Similarity Measures A. A. P. Ribeiro1 , L. Zhao1 , A. A. Macedo1 DCM, FFCLRP, University of São Paulo, SP, Brazil {angelribeiro,zhao,ale.alaniz}@usp.br Abstract. This article presents a comprehensive review on knowledge-based semantic similarity measures and similarity measures for text collections widely used in the literature. Such similarity measures have been applied in various types of tasks ranging from Information Retrieval to Natural Language Processing. Taking advantage of this review, we can systematize and compare them according to the type of data for structured, semi-structured, and multimedia data retrieval. For each type of data, usually there are different ways to measure the similarity. Therefore, the present review also contributes to the scientific community in such a way that it makes easier the comparison of semantic measures in terms of their evaluations, the selection of semantic measures according to a specific usage context, and the summarization of theoretical findings related to semantic measures. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications; I.2.6 [Artificial Intelligence]: Learning Keywords: Knowledge-based Model, Similarity Measure, Semantic Processing, Textual Mining.

1.

INTRODUCTION

Data similarity measures have long been studied due to their vast applications in computer science and in other branches of science. In the last years, textual similarity measures permanently overplayed in a wide range of research areas, acting on information extraction and processing, specially, in tasks of information retrieval, textual classification, clustering of documents, topic detection, tracking, generation of questions, question-answering applications, machine translation, textual summarization, data mining, Web search, clustering and recommender systems. Classic similarity measures can be classified into two approaches: (i) similarity measures based on features and (ii) similarity measures based on links. The first group of measures calculates the similarity between objects by considering their feature vectors. On the other hand, the similarities based on links define measures in terms of the structural composition of a graph. The similarities based on features are the textual similarity algorithms that can be further classified into the following four categories: (i) similarity measure based on strings, (ii) similarity measures based on terms, (iii) similarity measures based on collections and (iv) similarity measures based on knowledge. This article is going to deal with the knowledge-based and collection-based similarity. Most of the similarity measures based on features are defined in vector space, for example, the Cosine is the most traditional measure calculates the similarity between two non-zero vectors in an inner product space [Salton 1989]; the Manhattan distance is the distance between two objects in a grid based on a rigidly horizontal and/or vertical path [Krause 2012]; the Euclidean distance measures the distance between two points in Euclidean space [Greub 1967]; the Jaccard is a statistic similarity measure used for comparing the similarity and diversity of sample sets [Jaccard 1901]; additionally,

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

19

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A. A. P. Ribeiro, L. Zhao, A. A. Macedo

the Dice measure is defined as twice the number of common terms in the compared strings divided by the total number of terms [Dice 1945]; the Overlap Matching coefficient counts the number of similar terms (dimensions) [Ukkonen 1990]. Overlap is similar to the Dice coefficient, but it measures how similar two strings are in terms of the number of common bigrams; Charikar Similarity is a similarity measure related to the Jaccard index that measures the overlap between two sets and it is defined as the size of the intersection divided by the smaller of the size of the two sets [Charikar 2002]. Similarity measures applied to specific domains (such as [Ehsani and Drabløs 2016]) are not considered. This article presents a review on similarity measures for text collections and knowledge-based semantic measures widely used in the information retrieval and information extraction literature. The remaining sections of this article are organized as follows: Section 2 presents the similarity measured reviewed by the authors, Section 3 details a comparative study on the similarity measures, and Section 4 concludes with final remarks and future work. 2.

SIMILARITY MEASURES

Usually, similarity measures for text document collection are presented as semantic similarity coefficients that quantify the similarity between textual information (words, sentences, paragraphs, documents, etc) based on information obtained from corpora1 . Similarity measures at concept level is a kind of semantic similarity to identify the similarity between words using information extracted from semantic networks2 . The next two subsection present, respectively, similarity measures for text collection and knowledge-based semantic measures. Usually, similarity measures for text collections are called statical similarity measures and knowledge-based similarity measures are semantic measures. 2.1

Similarity Measures for Text Document Collections

Hyperspace Analogues to Language (HAL) creates a semantic space (represented by a matrix) of co-occurrence of words after the analysis of documents in textual collections [Lund et al. 1995; Lund and Burgess 1996]. This semantic space is often a space with a large number of dimensions, in which words or concepts are represented by objects; the position of each object along the axis is somehow related to the meaning of the word [Osgood et al. 1964]. To build the semantic space, first of all, it is necessary to define the meanings of a set of axes and gather information from human subjects in order to determine where each word in question should fall on each axis. A N × N cooccurrence matrix is composed of individual words as elements, where N is the number of words in the lexical vocabulary. The lexical co-occurrence has been established by HAL as a useful basis for the construction of semantic spaces [Burgess and Lund 1994; Lund and Burgess 1996; 1996]. Latent Semantic Analysis (LSA) intends to overcome the main problems related to the use of lexical based analysis: polysemy and synonymy [Landauer and Dumais 1997]. The similarities defined by LSA are based on closeness of terms in a semantic space built, according to the co-occurrence of all terms in collections of documents manipulated instead of lexical matching. LSA exploits the Singular Value Decomposition, linear algebra factorization, in which the matrix X composed of documents and words are decomposed into the product of three component matrices. The most important dimensions (with the highest values in the singular matrix) are selected to reduce the dimension of the working space. A semantic matrix is generated by the computation of the inner-product among each column of the matrix. Given the semantic matrix, similarities are identified by considering the higher cosines. Generalized Latent Semantic Analysis (GLSA) computes the semantic relationships between terms and document as vectors are computed as linear combinations of term vectors [Matveeva et al. 1A

corpus is a large collection of textual documents which is mainly used for information extraction, information retrieval, natural language researches. 2 A semantic network is a graph to knowledge representation of semantic relations between concepts (nodes). Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

20

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A Review of Text-Based and Knowledge-Based Semantic Similarity Measures

·

3

2005]. GLSA extends LSA focusing in term vectors instead of documents as LSA. GLSA is not based on bag-of-words3 . It exploits pair-wise term similarities to compute a representation for terms. GLSA demands a measure of semantic similarity between terms and a method of dimensionality reduction. GLSA combines any similarity measure and any reduction of dimensionality. Explicit Semantic Analysis (ESA) uses a corpus of documents as a knowledge base, it represents the individual words or entire documents as vectorial representations of text documents such as HAL, LSA and GSA [Gabrilovich and Markovitch 2007]. The ESA is a measure to calculate semantic relationships between any pair of documents, any corpora including Wikipedia articles and the Open Directory Project [Egozi et al. 2011]. The documents are represented as centroids of vectors representing its words. The words are represented as a column vector in the Term-Frequency and Inverse Term-Frequency (TF-IDF) array of the text corpus. The terms or texts of ESA are portrayed by vectors with high dimension. Each element of the vector represents the pair TF-IDF between terms of documents. The semantic similarity between two terms or texts is expressed by the cosine measure between the corresponding vectors. However, unlike LSA, ESA deals with human-readable labels transforming them into concepts that make up the vector space. The conversions are possible thanks to the use of a knowledge base [Egozi et al. 2011; Gabrilovich and Markovitch 2007]. The scheme is extended from single words to multiwords documents by simply summing the arrays of all words in the documents [Gabrilovich and Markovitch 2007]. The semantic relatedness of the words is given by a numeric estimation. Cross-Language Explicit Semantic Analysis (CLESA) is a multilingual generalization of ESA [Potthast et al. 2008]. CLESA manipulates documents that are aligned with a multilingual reference collection that corresponds documents as vectors of concepts without considering the languages. The relationships between two documents in different languages are calculated by cosine considering the vector space. A document written in a specific language is represented as ESA vector by using an index document collection in the language. The similarity between a document and a document from another language is quantified in the concept space, by computing the cosine similarity between both. Pointwise Mutual Information - Information Retrieval (PMI-IR) is a method to calculate similarities between pairs of words, using the AltaVista’s Advanced Search Query [Friedman 2004], calculating the probabilities of similarity the Alta Vista calculates the similarity [Turney 2001]. This probability is based on the proximity of the pair of words in Web pages, considering greater proximity greater similarity. The PMI-IR algorithm, like LSA, is based on co-occurrence [Manning et al. 1999]. The core idea is that “a word is characterized by the neighborhood it has” [Firth 1957]. There are many different measures of the degree to which two words co-occur [Manning et al. 1999]. Therefore, the ratio between one probability to another is a measure of the degree of statistical dependence. Once the equation is symmetrical, it is the amount of information that acquires about the presence that explains the term mutual information. Second-order Co-occurrence - Pointwise Mutual Information (SCO-PMI) is a semantic similarity measure that applies PMI-IR to sort the list of neighboring words of two target words being compared in a collection [Islam and Inkpen 2008; 2006]. The advantage of using SCO-PMI is that it calculates the similarity between two words that they do not co-occur frequently, but the same neighboring words are co-occurring. The preprocessed word pairs are taken to calculate semantic word similarity using SCO-PMI. It is a corpus-based method for determining the semantic similarity of two target words. Evaluation result shows that the method outperforms several competing corpusbased methods. This method focuses on measuring the similarity between two target words. After finding the similarity between all words in the document, the retrieval of similar information can be performed to user query too. 3 Bag-of-words

names a model that ignores context, semantic and order of words, simplifying computational efforts. As vocabulary may potentially run into millions, this model faces scalability challenges. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

21

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A. A. P. Ribeiro, L. Zhao, A. A. Macedo

Normalized Google Distance (NGD) is a semantic similarity measure using the number of hits returned by the Google search engine for a given set of keywords [Cilibrasi and Vitanyi 2007]. This algorithm returns the keywords with the same meanings or similar meanings based on natural language processing. The words are similar if that tend to be “near" in distance units of Google, while words with different meanings tend to be more distant. Similarity through the Co-occurrence Distribution (DISCO) assumes that words with similar meanings occur in a similar context [Kolb 2009]. Large text collections are statistically analyzed to obtain similarity of distribution. The DISCO is a method that calculates the similarity of distribution between the words using a simple context window of a size of approximately three words for the count of co-occurrences. When two words are submitted to the calculation of the exact similarity, DISCO simply retrieves its vectors from the indexed data and calculates the similarity according to the Lin measure [Lin 1998], presented next. If the most similar word according to the distribution is requested, DISCO returns the second order of the word vector. DISCO has two main similarity measures: DISCO1 and DISCO2. The DISCO1 calculates the first order similarity between two input words according to the word arrangement sets. The DISCO2 calculates the second order similarity between two input words, according to the distribution of similar words. 2.2 Knowledge-based Measures Knowledge-based measures try to identify the degree of similarity among concepts by using algorithms supported by lexical resources and/or semantic networks. The similarity measures based on knowledge can be separated into two groups: the semantic relatedness measures and the semantic-based measures. The semantic relatedness measures indicate the strength of the semantic interactions between objects since there are no constraints on the quality of the considered semantic links. Semantic relatedness similarities measures is a category of relationships between two words, incorporating a bigger range of relationships between concepts such as “is_type_of”, “is_one_exemple_specific_of”, “is_part_of”, “is_the_opposite_of” [Patwardhan et al. 2003]. The most used examples of semantic relatedness measures are Resnik (RES) [Resnik 1995; KG and SADASIVAM 2017], Lin (LIN) [Lin 1998], Jiang & Conrath (JCN) [Jiang and Conrath 1997], St.Onge (HSO) [Hirst et al. 1998], Lesk (LESK) [Banerjee and Pedersen 2002], and Pairs of Vectors (Vectors) [Patwardhan 2003]. RES is a measure of semantic similarity based on the notion of information content by considering an “is a” taxonomy. The value of RES is the information content of the Least Common Subsumer 4 [Resnik 1995; KG and SADASIVAM 2017]. LIN suggests the semantic similarity between two topics in a taxonomy [Lin 1998]. LIN is defined as a function of the meaning shared by the topics and the meaning of each of the individual topics. The meaning shared by two topics can be recognized by looking at the lowest common ancestor, which corresponds to the most specific common classification of the two topics. Once this common classification is identified, the meaning shared by two topics can be measured by the amount of information needed to state the commonality of the two topics. The semantic similarity is defined in terms of the hierarchical taxonomy. The disadvantage is to capture the semantic relationships in non-hierarchical components. JCN is a hybrid similarity measure that mixes words or concepts. It combines a taxonomy structure with measures based on corpus. Exploiting the best of both, the taxonomy helps to guarantee semantics, while the statistical approach ensures the evidence of the distribution of the exploited corpus. LIN and JCN increase the information content of the Least Common Subsumer by considering the sum of the content of concepts. LIN scales the content of the Least Common Subsumer, while JCN assigns the difference between the sum and the information content of the Least Common Subsumer. 4 It

is the most specific concept, which is an ancestor of both A and B

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

22

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A Review of Text-Based and Knowledge-Based Semantic Similarity Measures

·

5

The measure HSO finds lexical chains of strings relating two meanings of a word [Hirst et al. 1998]. This measure calculates relatedness between concepts using the path distance between the concept nodes, number of changes in direction of the path connecting two concepts and the allowableness of the path. When the relation between concepts is close, they are semantically related to each other [Choudhari 2012]. The LESK discovers overlaps in the glossary of synsets of WordNet (it is presented next). The relativeness score is the sum of the squares of the overlap lengths. On the other hand, the measure Vector creates for each word of the specific glossary of WordNet [Patwardhan 2003]. After, it represents each glossary/concept as a vector that is the mean of the co-occurrence vectors. Vector was developed as a measure of semantic relatedness that represents concepts using context vectors, and it is able to establish relatedness by measuring the angle between these vectors. This measure combines the information from a dictionary with statistical information derived from large corpora of text. In other words, semantic relatedness is then measured simply as the nearness of the two vectors in the multidimensional space (the cosine of two normalized vectors). One of the strengths of this measure is that the basic idea of the Vector measure can be used with any dictionary, disregarding the WordNet. The semantic-based measures consist of four categories: (i) the semantic similarity measures the taxonomic term relationships to extract the similarity; (ii) the semantic distance is the inverse of the semantic relativeness; (iii) the semantic dissimilarity is the inverse semantic similarity; and (iv) the taxonomic distance is related to the dissimilarity. In the literature, the most cited examples of semantic-based measures are Leacock & Chodorow (LCH) [Leacock and Chodorow 1998], Wu & Palmer (WUP) [Wu and Palmer 1994], and Path Length (Path) [Wu and Palmer 1994]. The LCH measures the length of the shortest path between two concepts using node-counting and exploiting the maximum depth of the taxonomy [Leacock and Chodorow 1998]. It returns a score indicating the similarity between the different meanings of a word. It considers the shortest path connecting the meanings and the maximum depth of the taxonomy in which the meanings occurs. On the other hand, the WUP measures the similarity between two meanings of a word considering the depth of the two meanings in the taxonomy and its Least Common Subsumer [Wu and Palmer 1994] . WUP is a prototype lexical selection system called UNICON, that represents the English and Chinese verbs based on a set of shared semantic domains and the selection information is also included in these representations without exact matching. The concepts are organized into hierarchical structures to form an interlingua conceptual base. The input to the system is the source verb argument structure. The measure Path quantifies the similarity between two meanings of a word based on the shortest path linking between the two meanings in the taxonomy “is_a" (hyperonyms/homonyms) [Wu and Palmer 1994]. This measure is inversely proportional to the number of nodes along the shortest path between the concepts. The shortest possible path occurs when two concepts are the same, in which case the length is 1. Thus, the maximum similarity value is 1. Overall, the presented approaches are distinct algorithms used to calculate semantic similarity between concepts. The semantic similarity can be improved by using human knowledge to generate a more accurate measure. Human knowledge usually is expressed in dictionaries, taxonomies, ontologies and concept networks. The WordNet is the most popular concept network to be used to measure similarity based on knowledge [Miller et al. 1990]. This concept network is a large graph, or a lexical database, where each node represents a real world concept that are English words classified as noun, verb, adjective and adverb. All words are grouped into sets of cognitive synonyms (synsets) and each one expresses a distinct concept, for example, the concept-object like a house, or an entity like a teacher, or an abstract concept like art, and so on. Every node of the network consists of a set of synonyms words called synsets. Each synset represents the real world concept associated with that node and it has associated with it a gloss (short definition or description of the real world concept). The synsets Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

23

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A. A. P. Ribeiro, L. Zhao, A. A. Macedo

and the glosses are similar to the content of an ordinary dictionary such as the synonyms and the definitions, respectively. Synsets are interlinked by means of conceptual-semantic and lexical relations. Each link or edge describes a relationship between the real world concepts represented by the synsets that are linked. Types of relationships are “opposite of”, “is a member of”, “causes”, “pertains to”, “ is a kind of”, “ is a part of” and others. The network of relations between word senses present in WordNet encodes a vast amount of human knowledge, giving rise to a great number of possibilities of knowledge representation used for various tasks. WordNet has been manipulated by different approaches in order to automatically extract its association relations and to interpret these associations in terms of a set of conceptual relations, such as the DOLCE foundational ontology [Gangemi et al. 2003]. In terms of limitations, WordNet does not present the etymology or the pronunciation of words and it is basically composed of the everyday English word. Agirre & Rigau developed a notion of conceptual density to create their algorithm for Word Sense Disambiguation [Agirre and Rigau 1997]. They used the context of a given word along with the hierarchy of “is-a” relations in WordNet to find the exact sense of the word. It divides the network hierarchy of WordNet into sub-hierarchies and each of the senses of the ambiguous word belongs to one sub-hierarchy. The conceptual density for each sub-hierarchy is then calculated using a conceptual density formula which, intuitively, describes the amount of space occupied by the context words in each of the sub-hierarchies.

3. A COMPARATIVE STUDY OF THE SIMILARITY MEASURES There is no standard way to evaluate similarity measures without the agreement of human judgments. Hence, computational similarity measures must rate the similarity of a set of word pairs, after it is necessary to look at how well their ratings correlate with human ratings of the same pairs. To evaluate measures, it is also useful to compare them by considering important features used by the most applications. Thereafter, assumptions can empower the scientific community in making easier the selection of measures according to a usage scenario, and in helping the summarization of findings related to the measures. Here, we present a comparative study of similarity measures. Table I presents similarity measures and the main requirements of similarity algorithms. The similarity measures are the nineteen presented measures and the classical measures such as Cosine, Jaccard, Dice Measure, Euclidean distance, Overlap Coefficient and Manhattan distance. The main requirements are Changeable Granularity (what is the granularity of information to be manipulated – word, paragraph, document, etc), Partial Matching, Ranking Relevance Allowed, Terms Weights (if there is a term-weighting scheme), Easy Implementation, Size Document Dependency, Dependency of Ordered Terms, Semantic Sensibility (what kind of knowledge-based resource is used), and the implemented approach by the measure. A comparative analysis shows that the classical vector space models (measures from line 1 to line 6) are used just for a very raw text level and they have no addition of information enriched by any semantic nor ontologies. On the other hand, the statical and semantic similarity measures (from line 7 to line 19) have semantic aggregated in the process. The classical measures and the semantic measures allow the modification of the level of granularity of the manipulated information. However, the statistical measures are not flexible in terms of that. The classical measures are dependent on the size and terms and they do not aggregate semantic. The statistical measures are not so dependent, but they usually have a higher computational cost and the semantic is focused on the collection. Finally, the semantic or knowledge-based measures are full of the specified requirements, but they can be a laborious task of implementation. In many cases, they are specific-domain. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

24

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A Review of Text-Based and Knowledge-Based Semantic Similarity Measures Table I.

·

7

Comparisons of desirable requirements of similarity measures.

Characteristics Jaccard [Jaccard 1901] Dice Measure[Dice 1945] Euclidean Distance [Greub 1967] Cosine [Salton 1989] Overlap Coefficient [Ukkonen 1990] Manhattan [Krause 2012] HAL [Lund et al. 1995] [Lund and Burgess 1996] LSA [Landauer and Dumais 1997] GeLSA [Matveeva et al. 2005] ESA [Gabrilovich and Markovitch 2007] CLESA [Potthast et al. 2008] PMI-IR [Turney 2001] SOCPMI [Islam and Inkpen 2008] [Islam and Inkpen 2006] NGD [Cilibrasi and Vitanyi 2007] DISCO [Kolb 2009] WordNet [Miller et al. 1990] WUP [Wu and Palmer 1994] Path Length [Wu and Palmer 1994] RES [Resnik 1995] JCN [Jiang and Conrath 1997] LIN [Lin 1998] LCH [Leacock and Chodorow 1998] HSO [Hirst et al. 1998] LESK [Banerjee and Pedersen 2002] Vector [Patwardhan 2003]

1 x x x x x x x x x x x x x x x x

2 x x x x x x x x x x x x x x x x x x x x x x x x x

3 x x x x x x x x x x x x x x x x

4 x x x x x x x x x x x x x x x x

5 x x x x x x x x x x x x x x x -

6 x x x x x x x x x x x x x x x x x x x

7 x x x x x x x x x x x x x x x x x x x

8 x x x x x x x x x x x x x x x x x x x

Approach Distance between vectors Distance between vectors Distance between vectors Distance between vectors Distance between vectors Distance between vectors Semantic Space: Words Co-occurrence, Similarity vector Singular value decomposition, cosine Terms vectors, Similarity measure and Dimensionality reduction. TF-IDF, Cosine, Machine Learning, and Wikipédia‘s data Cosine of the vectors’ Document Words Co-occurrence, Machine Learning Co-occurrence of neighboring words Google’s Distance, Natural Language Words Co-occurrence Window Distribution Psycholinguistic Theories and Human lexical memory Least common Subsumer and Shortest path Shortest path Least common Subsumer Semantic Similarity and Least common Subsumer Least common Subsumer Shortest path and Taxonomic Score Semantic Affinity and Lexical Chains Pairs Semantic Affinity and Lesk’s Dictionary with WordNet Semantic Affinity and Co-occurrence Matrix with Wordnet.

The columns are: 1: Changeable Granularity; 2: Partial Matching; 3: Ranking Relevance Allowed; 4: Terms Weights; 5: Easy Implementation; 6: Size Document Dependency; 7: Dependency of Terms; 8 Semantic Sensibility.

4.

FINAL REMARKS AND RESEARCH DIRECTIONS

The commonly used collection/knowledge-based similarity measures and corpus-based similarity measures have been covered in this review article. Up to now, the similarity measure is still an important subject under study and new ones are continuously proposed. Nevertheless, the future tendencies are appointing to the representation of text and its relation on networks. An example is the WordNet and ontologies, which have shown the effectiveness of the representation of textual data as a network. This tendency is supported by the whole new branch of network science, working on large-scale graphs with non-trivial topological structures, the complex networks. Now, we are working on this direction. As a result, the presented similarity measures are being augmented by network measures and they will be quantitative and qualitatively compared. Formulas and more detailed descriptions will be included and an application will carry out all measures. 5.

ACKNOWLEDGEMENTS

The authors would like to thank CAPES (1569180), FAPESP (2016/13206-4) and CNPq (302031/20162) for their financial support. REFERENCES Agirre, E. and Rigau, G. A proposal for word sense disambiguation using conceptual distance. AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE SERIES 4 , 1997. Banerjee, S. and Pedersen, T. An adapted lesk algorithm for word sense disambiguation using wordnet. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, pp. 136–145, 2002. Burgess, C. and Lund, K. Multiple constraints in syntactic ambiguity resolution: A connectionist account of psycholinguistic data. COGSCI-94, Atlanta, GA, 1994. Charikar, M. S. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, pp. 380–388, 2002. Choudhari, M. Extending the hirst and st-onge measure of semantic relatedness for the unified medical language system. Ph.D. thesis, University of Minnesota, 2012. Cilibrasi, R. L. and Vitanyi, P. M. The google similarity distance. IEEE Transactions on knowledge and data engineering 19 (3): 370–383, 2007. Dice, L. R. Measures of the amount of ecologic association between species. Ecology 26 (3): 297–302, 1945. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

25

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A. A. P. Ribeiro, L. Zhao, A. A. Macedo

Egozi, O., Markovitch, S., and Gabrilovich, E. Concept-based information retrieval using explicit semantic analysis. ACM Transactions on Information Systems (TOIS) 29 (2): 8, 2011. Ehsani, R. and Drabløs, F. Topoicsim: a new semantic similarity measure based on gene ontology. BMC bioinformatics 17 (1): 296, 2016. Firth, J. R. A synopsis of linguistic theory, 1930-1955, 1957. Friedman, B. G. Web search savvy: Strategies and shortcuts for online research. Psychology Press, 2004. Gabrilovich, E. and Markovitch, S. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJcAI. Vol. 7. pp. 1606–1611, 2007. Gangemi, A., Navigli, R., and Velardi, P. The ontowordnet project: extension and axiomatization of conceptual relations in wordnet. In OTM Confederated International Conferences" On the Move to Meaningful Internet Systems". Springer, pp. 820–838, 2003. Greub, W. H. Linear Algebra: 3d Ed. Springer, 1967. Hirst, G., St-Onge, D., et al. Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An electronic lexical database vol. 305, pp. 305–332, 1998. Islam, A. and Inkpen, D. Second order co-occurrence pmi for determining the semantic similarity of words. In Proc.of the International Conference on Language Resources and Evaluation, Genoa, Italy. pp. 1033–1038, 2006. Islam, A. and Inkpen, D. Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD) 2 (2): 10, 2008. Jaccard, P. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz, 1901. Jiang, J. J. and Conrath, D. W. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008 , 1997. KG, S. and SADASIVAM, G. S. Modified heuristic similarity measure for personalization using collaborative filtering technique. Appl. Math 11 (1): 307–315, 2017. Kolb, P. Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics-NODALIDA’09, 2009. Krause, E. F. Taxicab geometry: An adventure in non-Euclidean geometry. Courier Corporation, 2012. Landauer, T. K. and Dumais, S. T. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review 104 (2): 211, 1997. Leacock, C. and Chodorow, M. Combining local context and wordnet sense similarity for word sense identification. wordnet, an electronic lexical database, 1998. Lin, D. Extracting collocations from text corpora. In First workshop on computational terminology. Citeseer, pp. 57–63, 1998. Lund, K. and Burgess, C. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers 28 (2): 203–208, 1996. Lund, K., Burgess, C., and Atchley, R. A. Semantic and associative priming in high-dimensional semantic space. In Proceedings of the 17th annual conference of the Cognitive Science Society. Vol. 17. pp. 660–665, 1995. Manning, C. et al. Foundations of statistical natural language processing. Vol. 999. MIT Press, 1999. Matveeva, I., Levow, G.-A., Farahat, A., and Royer, C. Generalized latent semantic analysis for term representation. In Proc. of RANLP, 2005. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. Introduction to wordnet: An on-line lexical database. International journal of lexicography 3 (4): 235–244, 1990. Osgood, C., Suci, G., and Tannenbaum, P. The measurement of meaning. University of Illinois Press, 1964. Patwardhan, S. Incorporating dictionary and corpus information into a context vector measure of semantic relatedness. Ph.D. thesis, University of Minnesota, Duluth, 2003. Patwardhan, S., Banerjee, S., and Pedersen, T. Using measures of semantic relatedness for word sense disambiguation. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, pp. 241–257, 2003. Potthast, M., Stein, B., and Anderka, M. A wikipedia-based multilingual retrieval model. In European Conference on Information Retrieval. Springer, pp. 522–530, 2008. Resnik, P. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1. IJCAI’95. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 448–453, 1995. Salton, G. Automatic text processing. addison welsley. Reading, Massachusetts vol. 4, 1989. Turney, P. Mining the web for synonyms: Pmi-ir versus lsa on toefl, 2001. Ukkonen, E. A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica 5 (1): 313–323, 1990. Wu, Z. and Palmer, M. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 133–138, 1994.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

26

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Parameter Learning in ProbLog with Probabilistic Rules 1

Francisco H. O. V. de Faria

1

and Arthur C. Gusmão

2

and Denis D. Mauá

1

1

and Glauber De Bona

and Fabio G. Cozman

1

Escola Politécnica and Instituto de Matemática e Estatística  Universidade de São Paulo, Brazil 2

Abstract.

Probabilistic logic programs under the distribution semantics oer a exible language to specify deterministic rules and probabilistic assessments. State-of-art inference and learning algorithms are now available in the freely available ProbLog package. In this paper we describe techniques that speed up the learning algorithms in ProbLog, both for complete and incomplete datasets, by exploring a new approach for likelihood maximization. Our experiments show that our techniques signicantly speed up the learning process. Categories and Subject Descriptors: I.2.6 [Articial

Intelligence]:

Learning

Keywords: parameter learning, probabilistic logic programming, ProbLog

1.

INTRODUCTION

There is a myriad of ways to combine rst-order logic and probability, thus allowing reasoning about relational structures while handling uncertainty. Examples are relational Bayesian networks [Jaeger 1997], Markov logic networks [Richardson and Domingos 2006] and a variety of probabilistic logics [Halpern 1990; Ognjanovic and Ra²kovic 2000]. A well-explored path is to endow logic programs with probabilities. In fact, many are the proposals to extend the logic programing framework of Prolog with probabilistic semantics [Getoor and Taskar 2007; De Raedt et al. 2008]. A particularly popular semantics for probabilistic logic programs is due to Sato, and usually referred to as the

distribution

semantics. The idea is that we have rules, as in logic programming, such as

cares(X, Y ) :− person(X), person(Y ), neighbor(X, Y ). and probabilistic facts such as are neighbors is

0.8 :: neighbor(X, Y ),

(1)

meaning that the probability that any

X

and

Y

0.8.

Usually, one is interested in assigning probabilities not only to facts, but also to rules. For instance, we may want to express that

0.8.

person(X)

and

person(Y )

yields a proof for

cares(X, Y )

with probability

Then we could write

0.8 :: cares(X, Y ) :− person(X), person(Y ).

(2)

The syntax we just used can be found in the ProbLog package, a freely available system that implements state-of-art algorithms for inference and learning in the context of probabilistic logic programming [Fierens et al. 2015].

ProbLog adopts Sato's distribution semantics, together with

probabilistic facts and probabilistic rules.

Inference in ProbLog is done by model counting, and

parameter learning relies on the Expectation-Maximization algorithm (EM). One important feature

F. H. O. V. de Faria is supported by a scholarship from Toshiba Corporation. A. C. Gusmão is supported by a scholarship from CNPq. G. De Bona is supported by Fapesp, Grant 2016/25928-4. F. G. Cozman and D. D. Mauá are partially supported by CNPq. c 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that Copyright the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

27

5th KDMiLe – Proceedings

2

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

F. H. O. V. de Faria and A. C. Gusmão and G. De Bona and D. D. Mauá and F. G. Cozman

·

of ProbLog's EM-like algorithm is that it requires introducing a latent variable for each probabilistic rule in the program of interest. This is a major source of ineciency, that we x in this paper. In this work, we oer an alternative approach to learning parameters in probabilistic program with probabilistic rules. Instead of inserting unobservable variables, we exploit the intrinsic semantics of probabilistic rules to express the likelihood of the observations in function of the parameters, which is a main ingredient of parameter learning. The lesser number of latent variables speeds up the learning task, especially with complete data, when we can dispense with the EM algorithm altogether. This article is structured as follows: Section 2 presents ProbLog's probabilistic programs; Section 3 brings an overview of how parameter learning is implemented in ProbLog; our approach to parameter learning with probabilistic rules is put forward in Section 4; in Section 5, we present the results of experiments comparing the performance of our algorithm to ProbLog's implementation.

2.

PRELIMINARIES

We follow the syntax and semantics of probabilistic logic programs from the ProbLog framework [De Raedt et al. 2007; Fierens et al. 2015].

2.1

Syntax

Consider a vocabulary with logical variables

X, Y, . . . , predicate symbols r, s, . . .

and constants

a, b, . . . .

atom is an expression of the form r(t1 , t2 , . . . , tm ) where r is a predicate symbol with arity m, and each ti is a term, which is either a constant or a logical variable. An atom is ground if it has no logical variables. An atomic proposition is an 0-arity predicate symbol, which is also a ground atom. A grounding is a function taking logical variables and returning constants. A literal is an atom (A) possibly preceded by not (not A). A (deterministic) rule is an expression of the form H :− B1 , . . . , Bn ., where H is an atom, called the head, and each subgoal Bi is a literal, with B1 , . . . , Bn being the rule's body. A fact is a rule with empty body (H :− .). If an atom is in the head of some rule, it is said to be a derived atom. A set of rules is a logic program. The grounding of a rule is a ground rule obtained by applying the same grounding to each atom. The Each predicate symbol has an associated arity. An

grounding of a program is the propositional program obtained by applying every possible grounding to each rule and fact, using only the constants in the program. For a given logic program, the

dependency graph contains its ground atoms as nodes and arcs

only if there is a ground rule with is

A in the body, possibly negated, and B

hA, Bi

in the head. A logic program

acyclic if its dependency graph is acyclic. ProbLog programs are formed by standard logic programs together with

have the form

p :: F., where p ∈ [0, 1] is a real number and F

probabilistic facts, which probabilistic. Similarly

is an atom, called

to rules, a probabilistic fact can be grounded to form a set of ground probabilistic facts. Formally, we dene a

probabilistic program as a pair

hP, PFi, where P is a logic program, PF is a set of probabilistic P.

facts and probabilistic atoms are not derived in

2.2

Semantics

The semantics of a relational probabilistic program is simply dened through the semantics of its grounding, so we focus on the propositional case. ProbLog's semantics for probabilistic programs is

A in a program P |= A = 1 (P |= A = 0) i

based on the standard semantics of Prolog. It is convenient to view a ground atom

{0, 1} program P .

as random variables taking values in

A (A)

is entailed by the logic

(false and true), and we write

T = hP, PFi be a probabilistic program, and let {pi :: Fi .|1 ≤ i ≤ n} be the grounding of PF, p1 , . . . , pn ∈ [0, 1] and a set of ground atoms F = {F1 , . . . , Fn }. T implicitly denes a probability

Let for

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

28

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Parameter Learning for ProbLog with Probabilistic Rules distribution over the logic programs,

PPR (P ∪ F 0 ) =

Q

Fi

∈F 0

Q

pi

Fi ∈F \F 0

(1 − pi ),

where

F0

Given some evidence

E = e,

probability of

Q = q

e)/PT (E = e),

as usual. If

Example 2.1.

3

is a subset

F . For the given T = hP, PFi, we can dene the probability PT of a given set of ground Q = {Q1 , . . . , Qn } have the truth value q = hq1 , . . . , qn i ∈ {0, 1}n (Q = q ), by employing PPR : X PT (Q = q) = {PPF (P ∪ F 0 ) | P ∪ F 0 |= Q = q} . of

·

atoms

(3)

which is a set of ground atoms (E ) and their observed values (e), the

PT (Q = q|E = e) = PT (Q = q, E = P (Q) denotes its probability distribution.

becomes the conditional probability

Q

is a set of random variables,

Consider the following probabilistic program

T:

0.2 :: Burglary. 0.3 :: Fire. Alarm :− Burglary. Alarm :− Fire. Suppose we want to compute PT (Alarm = 1). With two probabilistic facts, there are four total choices F 0 ⊆ {Burglary, Fire}. Taking P = {Alarm :− Burglary., Alarm :− Fire.}, P ∪ F 0 |= Alarm = 1 for any non-empty F 0 . Hence, PT (Alarm = 1) = 0.2 × 0.3 + 0.2 × 0.7 + 0.8 × 0.3 = 0.44. 2.3

Adding Probabilistic Rules

We can augment the syntax and semantics of probabilistic programs to allow for probabilistic rules, like

p :: H :− B1 , . . . , Bn .,

where probabilities are annotated to deterministic rules with non-empty

T = hP, PRi, where P is a logic PR is a set of probabilistic rules. Grounding the probabilistic rules, one would have a set {pi :: Ri .|1 ≤ i ≤ n}, where R = {R1 , . . . , Rn } is a set of ground (deterministic) rules. Then 0 0 the probability of a total choice R ⊆ R entails the probability of a logic program PPR ((P ∪ R )) = Q Q pi (1 − pi ), which analogously denes a probability of a set of ground atoms Q have truth bodies.

In this case, a probabilistic program would be a pair

program and

Ri ∈R0

value

q:

Ri ∈R\R0

PT (Q = q) =

X {PPR (P ∪ F 0 ) | P ∪ F 0 |= Q = q} .

(4)

Using only probabilistic facts though, one can simulate probabilistic rules as well.

Each prob-

p :: H :− B1 , . . . , Bn . is equivalent to a pair formed by a deterministic rule H :− B1 , . . . , Bn , prob(id). and a probabilistic fact p :: prob(id)., where id is an identier corresponding to this rule, and prob(id) does not occur anywhere else [Fierens et al. 2015]. Due to this equivalence,

abilistic rule

ProbLog internally works only with probabilistic facts, not probabilistic rules, implicitly transforming each

p :: H :− B1 , . . . , Bn .

into

H :− B1 , . . . , Bn , prob(id).

and

p :: prob(id)..

When we refer to

a probabilistic program, we mean the more economic denition notion without probabilistic rules, unless stated otherwise, knowing that probabilistic rules can be simply seen as syntactical sugar.

3.

PARAMETER LEARNING IN PROBLOG

Here we take parameter learning to be the the task of, given some training data, nding the maximum-

hP, PFi, where both P and the atoms of PF (the At(T ) be the set of all ground atoms from a probabilistic = e denotes a set of ground atoms E ⊆ At(T ) together

likelihood probabilities for a probabilistic program

structure ) are xed. Formally, let program T = hP, PFi. An observation E program

with their truth value

e ∈ {0, 1}|E| ,

and a

dataset is a set of observations. The parameter learning

problem is dened via its input and output:

P and a set of facts {F1 , . . . , Fn } (the structure of a probabilistic program Tp = hP, PFi where the set of probabilistic facts is PF = {pi :: Fi , 1 ≤ i ≤ n} and the parameters are p = hp1 , . . . , pn i); (ii) a dataset D = {E1 = e1 , . . . , Em = em } (the training examples);

Input: (i) a logic program

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

29

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

F. H. O. V. de Faria and A. C. Gusmão and G. De Bona and D. D. Mauá and F. G. Cozman

Output: the maximum-likelihood probabilities

pˆ = hˆ p1 , . . . , pˆn i,

pˆ = argmaxp PTp (D) = argmaxp

m Y

where

PTp (Ei = ei ) .

(5)

i=1 In the following, we briey sketch the algorithms implemented in ProbLog to tackle the parameter learning problem, which are implemented in its function LFI (after Learn From Interpretations) [Fierens et al. 2015].

3.1

Complete Data (Full Observability)

When an observation

E=e

is such that

E = At(Tp ),

we say it is

complete. A dataset

D

is complete

if each of its observation is so. In such case, the maximum-likelihood parameters can be computed straightforwardly via counting.

pi :: Fi ∈ PF that can PF are not derived in P. Consequently, the probability of the ground probabilistic atom Fij being true, PT (Fij = 1), is exactly the probability associated to the probabilistic fact pi :: Fi . ∈ PF  that is, PT (Fij = 1) = pi . Hence, the parameters p ˆ that maximize likelihood for a given complete dataset can be computed simply through the ratio of the groundings of the probabilistic fact pi :: Fi observed to be true. Consider a probabilistic program

be grounded to form

pi :: Fij ..

T = hP, PFi,

with a probabilistic fact

By denition, probabilistic atoms appearing in

F = {F1 , . . . , Fn } be the set of probabilistic facts, and let {Fij | 1 ≤ j ≤ Zi } be the set Fi , for each 1 ≤ i ≤ n. For a complete dataset D = {E1 = e1 , . . . , Em = em }, maximum-likelihood parameters p ˆ = hˆ p1 , . . . , pˆn i are given by, for 1 ≤ i ≤ n:  m Zi 1 XX 1 if Fij ∈ Ek ; k k δi,j , where δi,j = (6) pˆi = 0 otherwise. mZi

Formally, let

of possible groundings of the

k=1 j=1

mZi is the number of groundings of the probabilistic atom Fi observed D. In the propositional case, Zi = 1 for every 1 ≤ i ≤ n. Typically, learning parameters for a relational probabilistic program is possible with a single observation D = {E = e}, as a large Zi >> 1 guarantees many observable groundings for the same probabilistic fact pi :: Fi . The normalization factor

through the whole dataset

3.2

Incomplete Data (Partial Observability)

In practice, learning with incomplete data is a common scenario. As the direct counting approach 1

from the previous section is not an option when there is unobserved ground probabilistic atoms , the Expectation-Maximization (EM) algorithm [Dempster et al. 1977] has been the main tool to learn parameters in this situation. The idea behind the EM algorithm implemented in ProbLog is: (E-step) for each observation

Ek ,

use a set of parameters

pt

to compute the probability of each unobserved

k δi,j (from Equation (6)); (the k t+1 M-step) employs these expected values and the observed δi,j to obtain new parameters p via Equation (6). ground fact

Fij

being true given

Ek ,

which is the expected value of

Formally, ProbLog's parameter learning function takes as input a logic program P, a set of facts {F1 , . . . , Fn } and a dataset D = {E1 = e1 , . . . , Em = em }. Let Fi1 , . . . , FiZi denote the groundings of the fact Fi , for each i, and let Tp = hP, {p1 :: F1 , . . . , pn :: Fn }i denote the probabilistic program determined by the parameters hp1 , . . . , pn i. After setting the value of each pi ∈ [0, 1] randomly, the

LFI-ProbLog algorithm ([Fierens et al. 2015]) iterates between the two steps below, until the likelihood (PTp (D)) increment is less than a threshold:

1 With

deterministic rules, observing the probabilistic atoms determine the truth value of derived atoms.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

30

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Parameter Learning for ProbLog with Probabilistic Rules (E-step): for each (M-step): for each

When

Fij

is in

hi, j, ki

s.t.

k 1 ≤ i ≤ n, 1 ≤ j ≤ Zi , 1 ≤ j ≤ m, δi,j = PTp (Fij = 1|Ek = ek );

1 ≤ i ≤ n, pti =

1 mZi

Zi m P P

k=1 j=1

Ek , PTp (Fij = 1|Ek = ek )

5

·

k δi,j .

in the E-step is simply

1

or

0,

and there is no need for

performing inference. In all other cases ProbLog must perform an inference for

PTp (Fij = 1|Ek = ek ), Fij is independent

although some optimizations are possible. For instance, ProbLog can detect when from

Ek ,

3.3

Handling Probabilistic Rules

yielding

PTp (Fij = 1|Ek = ek ) = pi .

For more optimization details, see [Fierens et al. 2015].

Tp , whose parameters we want to learn from a dataset D, has a probabilisp :: H :− B1 , . . . , Bn ., ProbLog interprets it as H :− B1 , . . . , Bn , prob(id). and p :: prob(id).. Hence, the introduction of the new probabilistic atom prob(id) makes any observation incomplete. In other words, a probabilistic rule p :: H :− B1 , . . . , Bn . is inserted in the probabilistic program whose

If the probabilistic program tic rule

parameters are to be learned, ProbLog applies the algorithm sketched above no matter the dataset

D,

due to the fresh, unobserved atoms

prob(id).

Each ground probabilistic rule is responsible for one

of such new atoms and the larger the number of latent atoms, the slower is the convergence. To circumvent that, the user himself could input probabilistic rules translated into

prob(id)

H :− B1 , . . . , Bn , prob(id).

and

p :: prob(id).,

already

within the input observations. Nonetheless, this approach is inviable mainly due to the fact

that these atoms are essentially not observable. For instance, if

prob(id). probabilistic rules, p1 :: H :− B1 . and p2 :: H :− B2 , share true, then there is no means to tell which prob(id) is true. also false and there is no way to tell the value of

4.

p :: H :− B1 , . . . , Bn .

giving also the truth value of each

B1

is false in an observation,

H

is

Furthermore, it may the case that two the same head, and if

H, B1 , B2

are all

OUR APPROACH

In short:

there is no need for inserting an auxiliary atom for each probabilistic rule to perform

Tp = hP, PRi with probabilistic rules PR = {pi :: Ri . | 1 ≤ i ≤ n}, we can adopt the augmented semantics from Section 2.3. Thus, one can compute the likelihood of Tp for an observation E = e directly through the expression for PTp (E = e) given in Equation (4). This expression is a function of the parameters pi , and its maximum yields the solution parameter learning. Given a probabilistic program

to the parameter learning problem. Henceforth, we adapt the learning problem to allow probabilistic its input is a logic program P, a set of rules {R1 , . . . , Rn } and a dataset D = {Ei = ei | 1 ≤ i ≤ m}; and its output is the parameter vector hp1 , . . . , pm i yielding the probabilistic program Tp = hP, PRi, with PR = {pi :: Ri | 1 ≤ i ≤ n}, that maximizes PTp (D) (the likelihood).

rules:

4.1

Complete Data

When data are complete, we can write down the (log-)likelihood and maximize it directly. Let

a1 , . . . , Ao = ao } (E = e)

PTp (E)

be a complete observation.

{A1 =

Using the dependency graph, we can factor

o Q

PTP (Ai = ai |P a(Ai )), i=1 is the set of ground atoms that are parents of Ai in the dependency graph 

in the usual way, employing the Markov condition:

PTp (E = e) =

P a(Ai ) ⊆ E {Ai } ∪ P a(Ai ) is Ai 's family. When considering a dataset D = {E1 = eq , . . . , Em = em }, m Q that PTp (D) = PTp (Ek = ek ), and each PTp (Ek = ek ) can be factored in this way. where

we have

k=1

Each of the terms

PTp (Ai = ai |P a(Ai ))

in

PTp (D)

is a function depending only on the parameters

attached to rules sharing as head a same ground atom

Ai .

Besides that, if

Aj

and

Ai

are groundings

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

31

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

F. H. O. V. de Faria and A. C. Gusmão and G. De Bona and D. D. Mauá and F. G. Cozman

PTp (Ai |P a(Ai )) and PTp (Aj |P a(Aj )) share the same parameters. If a set of ground A = {A1 , . . . , An } share the same predicate symbol r, we call the set A ∪ {P a(Ai ) | Ai ∈ A} the (predicate) family of r . Using predicate families, we can partition the factors PTp (Ai = ai |P a(Ai )) of PTp (D), and each partition can be maximized independently due to the disjoint parameters. of the same atom,

atoms

Tp where the predicate r0 (·) appears in the head p1 :: r0 (X) :− r1 (X), not r2 (X). and p2 :: r0 (X) :− not r1 (X), r2 (X)., and a complete observation E = e. The family of r0 (·) include all groundings of r0 (·), r1 (·) and r2 (·). For each possible instantiation of X , let r(X) = hr0 (X), r1 (X), r2 (X)i denote a set of ground atoms. Suppose that, for a0 dierent values of X , r(X) = h1, 1, 0i ∈ E , for b1 values of X , r(X) = h0, 1, 0i ∈ E , for b0 values of X , r(X) = h1, 0, 1i ∈ E and for b1 values of X , r(X) = h0, 0, 1i ∈ E . The a0 b a b0 likelihood PTp (E = e) then has a factor p1 (1 − p1 ) 1 p2 (1 − p2 ) 1 (from r0 's family), maximized at p1 = a0 /(a0 +a1 ) and p2 = b0 /(b0 +b1 )  and these are part of the hp1 , . . . , pn i maximizing PTp (E = e). For instance, consider a probabilistic program

of exactly two rules,

We noticed that several other combinations of rules sharing the same head lead to likelihood factors whose maximizations have exact solutions. In our implementations we used exact solutions to nd maximum likelihood parameters whenever possible.

When the likelihood expressions cannot

be maximized in a closed form, we resort to a gradient-based numerical optimization to nd the maximum-likelihood parameters

hp1 , . . . , pn i.

In comparison to ProbLog's method, our approach dispenses with the EM-like algorithm when the data are complete, even with probabilistic rules. Consequently, the whole learning problem is solved in a single numerical optimization for each predicate family.

4.2

Incomplete Data

When the data are incomplete, we employ an EM algorithm, although we can also avoid inserting an auxiliary variable for each ground probabilistic rule.

Tp = hP, PRi

Suppose we have a probabilistic program

E = e. Let Z = At(Tp ) \ E be the set of unobserved ground atoms. For each complete observation E = e, Z = z , we can express the (log-)likelihood in terms of the parameters p = hp1 , . . . , pn i, as explained in the section above. Thus, a probability distribution over Z yields an expected value for the log-likelihood. As an observation E = e determines a conditional probability distribution for Z , it also determines an expected log-likelihood. When considering a dataset D = {E1 = e1 , . . . , Em = em }, the expected log-likelihood can be computed by summing over the observations. Setting initially pi = 0.5 for each parameter pi , our implementation and an incomplete observation

of the EM algorithm iterates between the following steps until some convergence criterion is met:

(E-step): Given a set of parameters

z ∈ {0, 1}|Z| ;

p, for each Ek = ek ∈ D, compute PTp (Z = z|Ek = ek ) for each

(M-step): Find the set of parameters maximizing the expected log-likelihood:

argmaxp0

m X

X

PTp (Z = z|Ek = ek ) ln(PTp0 (Ek = ek , Z = z)).

(7)

k=1 z∈{0,1}|Z|

To compute the terms

PTp (Z = z|Ek = ek ),

we employ ProbLog's inference. In principle, for each

F(Ai ) denote Ai . If c ∈ {0, 1}|Q| is a vector of truth values associated to a set Q of |Q0 | ground atoms (Q = c), we use cQ0 ∈ {0, 1} to denote the vector of truth values corresponding to 0 the subset Q ⊆ Q. Now the expected log-likelihood, for a each Ek = ek , can be rewritten as X X PTp (F(Ai ) = c|Ek = ek ) ln(PTp0 (Ai = cAi |P a(Ai ) = cP a(Ai ) )). (8) Ek = ek ,

each

z

would yield an inference in the E-step, but we can do much better. Let

the family of a ground atom

Ai ∈At(Tp ) c∈{0,1}|F (Ai )|

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

32

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Parameter Learning for ProbLog with Probabilistic Rules

·

7

Thus, assuming the observations are consistent with the model, we only need to make inferences

PTp (F(Ai ) = c|Ek = ek )

for those

c

such that

PTp0 (Ai = cAi |P a(Ai ) = cP a(Ai ) )

is a (non-constant)

function of the parameters, for the others can be ignored during the maximization. suppose that the ground atom

H

is the head of a single ground probabilistic rule

For instance,

p :: H :− B1 , B2 ..

c regarding the H 's family {H, B1 , B2 }. Nonetheless, we have to c = h0, 1, 1i and c = h1, 1, 1i, for the other values of c either yield a constant PTp0 (H = cH |P a(H) = cP a(H) ) = 1 or PTp (F(H) = c|Ek = ek ) = 0.

There are eight possible values for consider only two of them,

5.

EXPERIMENTS AND RESULTS

The goal of the experiments is to compare our approach with LFI-ProbLog algorithm for varying missing data rates and dataset sizes. To accomplish this, we used the following program:

0.3 :: fire(X) :− person(X). 0.4 :: burglary(X) :− person(X). 0.7 :: alarm(X) :− fire(X).

0.9 :: alarm(X) :− burglary(X). 0.8 :: cares(X, Y) :− person(X), person(Y). 0.8 :: calls(X, Y) :− cares(X, Y), alarm(Y), not samePerson(X, Y).

To generate the datasets we sampled from the model above for a given number of constants (dataset size). For each constant

c deterministic facts person(c) and samePerson(c, c) were added to the model.

To impose a missing rate we discarded part of the generated observations using a pseudo random function. All tests were run four times, each time with a new independently sampled dataset. Each datapoint in Table 1 therefore corresponds to the average computation time among these four runs. ProbLog's stopping criteria is dened over the convergence of the log-likelihood values observed. Notice, however, as ProbLog's iterative process diers substantially from our algorithm's, using the same stopping criteria does not guarantee similar log-likelihood values are reached. In order to ensure

× 10−3 , which means it stops −3 when the log-likelihood variation between subsequent iterations is smaller than 1 × 10 and set our a valid comparison basis, we set a xed Problog's stopping criteria1

algorithm to stop whenever it reaches an equal or better log-likelihood value than ProbLog for the same dataset.

All experiments were performed in parallel on a dedicated machine with the following specications: 8 vCPU, 2900 MHz, 15 GiB RAM. From the results we can see that our algorithm outperforms ProbLog in the vast majority of cases, only losing for datasets with 5 constants and missing rate above or equal to 20%. It is also worth noting that, when the size of the datasets increases, the ratio between our approach and ProbLog's computation time tends to decrease. Dataset size is limited to 25 constants because for larger datasets ProbLog approximated the likelihood values to zero, returning an error when trying to calculate its log.

6.

CONCLUSION

We have presented a new approach to learn the parameters for a probabilistic program with probabilistic rules. We have shown how one can avoid the insertion of latent variables for probabilistic rules. In particular, this avoids the need for using the EM-algorithm when the data are complete. Experiments indicates signicant gains in time, when comparing to ProbLog, even when there is missing data. Future work includes applying these techniques to perform structure learning; that is, learning the rules of a probabilistic program.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

33

5th KDMiLe – Proceedings

8

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

F. H. O. V. de Faria and A. C. Gusmão and G. De Bona and D. D. Mauá and F. G. Cozman

·

1,500

Time to nish (seconds)

2,000

1,500 1,000 1,000 500 500

0

0 5

10

15

20

25

5

10

Size of dataset

15

20

25

20

25

Size of dataset

(a) 0% Missing Rate

(b) 10% Missing Rate 3,000

Time to nish (seconds)

3,000

2,000 2,000

1,000

1,000

0

0 5

10

15

20

25

5

Size of dataset

10

15

Size of dataset

(c) 20% Missing Rate

(d) 30% Missing Rate

Fig. 1: Time to learn parameters from incomplete relational data. Size of dataset is the the number of constants in the program. Solid squares were generated by ProbLog; empty circles were generated by our algorithm.

REFERENCES Probabilistic inductive logic programming. Springer, 2008. De Raedt, L., Kimmig, A., and Toivonen, H. Problog: A probabilistic prolog and its application in link discovery, 2007. Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), 1977. Fierens, D., Van den Broeck, G., Renkens, J., Shterionov, D., Gutmann, B., Thon, I., Janssens, G., and De Raedt, L. Inference and learning in probabilistic logic programs using weighted boolean formulas. Theory and Practice of Logic Programming 15 (3): 358401, 2015. Getoor, L. and Taskar, B. Introduction to statistical relational learning. MIT press, 2007. Halpern, J. Y. An analysis of rst-order logics of probability. Articial intelligence 46 (3): 311350, 1990. Jaeger, M. Relational bayesian networks. In Proceedings of the Thirteenth conference on Uncertainty in articial intelligence. Morgan Kaufmann Publishers Inc., pp. 266273, 1997. Ognjanovic, Z. and Ra²kovic, M. Some rst-order probability logics. Theoretical Computer Science 247 (1): 191212, 2000. Richardson, M. and Domingos, P. Markov logic networks. Machine learning 62 (1): 107136, 2006. De Raedt, L., Frasconi, P., Kersting, K., and Muggleton, S. H.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

34

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A Dispersion-Based Discretization Method for Models Explanation Bernardo Stearns2 , Fabio Rangel1,2 , Fabr´ıcio Firmino,2 , Jonice Oliveira1,2 1

Departamento de Ciˆencia da Computa¸c˜ ao Universidade Federal do Rio de Janeiro, Brazil 2 Programa de P´ os-Gradua¸c˜ ao em Inform´ atica Universidade Federal do Rio de Janeiro, Brazil, Brazil Abstract. Machine learning algorithms are mainly focused on generalization (e.g. predicting unseen data in a classification problem). However a new perspective on their efficiency would be the capability of explaining the predictions. By the process of binary encoding the features, it is possible to propitiate an intuitive interpretation, but it could lead to loss of information. This work aims to explore this trade-off, in which a discretization method based on the dispersion values of numerical features and its effects on the accuracy of a machine learning method are evaluated. As results, the experiment showed that is possible to achieve a good performance even after the discretization. Categories and Subject Descriptors: [Theory of computation]: Machine learning theory; [Theory of computation]: Data modeling; [Computing methodologies]: Supervised learning by classification Keywords: Discretization Methods, Supervised Learning,Model Explanation, Model Interpretability

1. INTRODUCTION Model explanation is the problem of understanding the reasons behind models predictions [Ribeiro et al. 2016]. The explanation is important for applications domains requiring decision making based on models predictions. The motivation of such predictions helps to ensure if a model is trustworthy, besides that it enriches the information for the decision-maker. Although, explainable machine learning algorithms shorten the gap between data modelling and knowledge extraction [Vellido et al. 2012]. For instance, in medical domains, interpretable machine learning are desired, since it is possible to create knowledge by the prediction explanation. In machine learning there are interpretable models, such as Decision Trees [Andrzejak et al. 2013]. However, the overfitting problem in Decision Trees is usually addressed using ensembles, which can difficult the interpretability. Locally Interpretable Model-Agnostic Explanation (LIME) is a framework proposed by [Ribeiro et al. 2016] which aims to interpret the prediction of any supervised algorithm. LIME does it by learning a interpretable model locally around the prediction. Although, LIME needs a representation of the features in a discrete space, which is also binary. Discretization is a process that consists in converting attributes from a continuous space into discrete space [Chmielewski and Grzymala-Busse 1996]. This is an important process for machine learning algorithms, since many of them works in discrete space (e.g. ID3 and Weightless Neural Networks), while real-world tasks usually present data in a continuous space [Dougherty et al. 1995]. Distinct discretization (binarization) methods may result in different interpretations for the decision-maker.

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computa¸c˜ ao. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

35

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

B. Stearns, F. Rangel, F. Firmino, J. Oliveira

Besides that, it is important for the LIME to reduce the loss of information when binarizing the features. The evaluation of this loss can be performed by comparing the prediction results of a simple classifier. Support Vector Machines (SVM) [Vapnik 1995] is suitable for this task, since it can perform well with continuous and discrete space. This work proposes an unsupervised discretization method based on the dispersion to the mean value in order to obtain a binary representation of continuous features. This method uses the concept of thermometer presented in [Kappaun et al. 2016]. The idea is to map the feature values to a binary dispersion space (normalized space). This could help the LIME framework in reasoning about the prediction by telling how far a feature would be from the mean value, positively or negatively. To evaluate the discretization, we compared the accuracy of SVM (before and after the discretization) using k-Fold Cross Validation over 3 datasets. The remainder of the paper is organized as follows. Section 2 presents the proposed discretization. Section 3 describes the experiment details, such as used datasets, support vector machines ,validation, and metrics. Section 4 shows the results and a discussion about it. The conclusion is presented in Section 5. 2. DISPERSION DISCRETIZATION (DD) METHOD To understand the DD, one must understand first what a thermometer representation is. The name thermometer is used in [Kappaun et al. 2016], however the idea came from [Carneiro et al. 2015] where it was used to discretize probabilities. A thermometer is a representation of many thresholds. Consider a vector of d length where each dimension represents a range of possible values of a feature. The vector dimension will be set to one if the feature value is lower than the threshold value. The Figure 1 shows a thermometer encoding for a specific feature which value ranges [0, 1].

Fig. 1.

Thermometer binary encoding example.

The main idea behind DD is to elaborate a thermometer encoding where the thresholds depends on the dispersion of the observations from the mean value. Furthermore, the zero position is marked in the middle of the thermometer, considering the symmetry of the problem. For this, we first need to map the observations to a normalized space using the Equation 1. This process is also known as Standardize. In Equation 1, µ is the average value of the observations as long σ is the sample standard deviation.

z=

x−µ ˆ σ ˆ

(1)

After this data transformation, it is possible to discretize the feature value by defining thresholds on the dispersion space. Thus, the binary vector is representing the distance to the mean value measured in terms of a scale of the standard deviation. The scale is a parameter of the discretization method which aims to make the number of the intervals flexible. Note that the last vector position cannot have an upper boundary, being included values greater than the previous position upper boundary. The same goes to the first position of the vector, since it is symmetric. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

36

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Discretization Methods and Their Impact on Models Explanation

Fig. 2.

·

3

Dispersion Discretization Method explained.

The Figure 2 presents how DD works. It is possible to understand that DD maps the feature distribution to a multinomial distribution, using intervals based on sample standard deviations. We expect that predictions interpretation could take in account how far the observed value is from the mean value. 3. EXPERIMENTAL SETUP The experiment consisted in empirically analyzing the behavior of the accuracy of the model when the width of the discretization method changed, which caused the number of intervals to change. To do so, 3 datasets were used: (i) German Credit Dataset; (ii) Australian Credit Approval; (iii) and Wine. All datasets were obtained from UCI Machine Learning Repository1 . This Section describes the datasets, the Support Vector Machines classifier, and the validation process. 3.1

Datasets

The datasets in the present work are commonly used for machine learning performance evaluation. For instance, German Credit Dataset and Australian Credit Dataset were used in [Eggermont et al. 2004] to evaluate the work methodology. Wine Dataset is a famous dataset which was already used to evaluate SVM classification performance [Bredensteiner and Bennett 1999] [Hsu and Lin 2002]. The main criterion of this work for choosing the datasets were the presence of numerical features for applying the suggested discretization method. An important measure of the datasets is the balance of the classes that are aimed to be predicted, because it establishes a lower bound for the classification accuracy. The datasets have respectivelly the class proportions presented in Table I.

Table I.

Datasets class proportion and accuracy baseline.

Dataset Australian Credit Dataset German Credit Dataset Wine Dataset

Class A

Class B

383 700 71

307 300 59

Class C 48

Accuracy Baseline 0.5550 0.7000 0.3988

1 http://archive.ics.uci.edu/ml

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

37

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

B. Stearns, F. Rangel, F. Firmino, J. Oliveira

3.1.1 Australian Credit Dataset. The australian dataset consists of a database from credit card applications, having 690 instances and 14 features, 6 continuous features and 8 categorical features. 3.1.2 German Credit Dataset. This dataset categorizes people described by a set of attributes as good or bad credit risks. The dataset consists of 1000 instances and 20 features, 7 continuous features and 13 categorical features. 3.1.3 Wine Dataset. Wine Dataset consists of a wine chemical analysis produced in the same region in Italy but derived from three different crops. It is composed of 178 instances and 13 features, 7 continuous features and 6 categorical features. 3.2

One-Hot Encoding

Although the present work proposes the use of a specific preprocessing technique in order to transform numerical features into binary categorical features, there are non-numerical features in the datasets. To enhance the classifier performance when dealing with these kind of features, it is important to apply a One-Hot Encoding technique, which consists in mapping the categorical features into numerical ones. Each possible value (category) for this feature is transformed into a new binary feature, having its bit set to one whenever the old feature assumes the associated category. 3.3

Support Vector Machines (SVM)

Support Vector Machines [Vapnik 1995] is one of the most popular machine learning algorithm, being able to perform classification and regression. SVM robustness and generalization capability is widely emphasized in the literature. SVM is also largely applied to text categorization problems, and [Joachims 1998] describe the reasons why SVM should have success performing text categorization.

Fig. 3.

SVM optimization problem.

In a binary classification problem, the algorithm consists in finding the best hyperplane that separates linearly the data over classes. The problem of finding the best hyperplane can be described of maximizing the distance to the margin points as showed in Figure 3. Therefore, maximizing the 1 , where w is the normal vector to the separating hyperplane distance is the same as maximize kwk subjected to: |wT xi + b| ≥ 1 ∀i ∈ {1, 2, ..., n} Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

38

(2)

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Discretization Methods and Their Impact on Models Explanation

·

5

The Equation 2 considers the classification problem in which data points are i.i.d. observations that comes in pairs: (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ), and b is the bias of the hyperplane. Considering that Y ∈ {−1, 1}, the constrained optimization problem can be rewrote as follows: 1 T w w w,b 2 subject to yi (wT xi + b) ≥ 0

minimize

(3) ∀i ∈ {1, 2, ..., n}

By Karush-Kuhn-Tucker condition [Kuhn and Tucker 1951], it is possible to define the Lagrangian as:

L(w, b, α) =

n X 1 T w w− αi [yi (wT xi + b)] 2 i=1

(4)

The Lagrange dual function for the Equation 4 is: g(α) = min L(w, b, α) w,b

(5)

Subsequently, the dual problem can be found combining Equations 4, 6, and 7, resulting in the Equation 8.

maximize α

n X i=1

subject to

n X

∇w L = 0

(6)

∂L =0 ∂b

(7)

n

αi −

n

1 XX αi αj yi yj xTi xj 2 i=1 i=1

(8)

αi yi = 0

i=1

SVM can also make use of a kernel function [Hsu et al. 2003] to perform the linear separation in a higher dimension. The present work used Radial Basis Function (RBF) kernel as kernel function, which is presented in Equation 9. It is important to mention that SVM implementation used was the one presented in the scikit-learn library [Pedregosa et al. 2011], and the hyper-parameters was set to the default values. 2

K(xi , xj ) = e−γkxi −xj k , γ > 0 3.4

(9)

Validation

To evaluate SVM performance, a k-Fold Cross Validation [Refaeilzadeh et al. 2009] (with k = 10) was applied in order to calculate the average accuracy. In the 10-Fold Cross Validation, the dataset is split in 10 parts. Then, one part is used as test while the other 9 are used for training. This process was applied to each discretization granularity (number of intervals). Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

39

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

B. Stearns, F. Rangel, F. Firmino, J. Oliveira

4. RESULTS AND DISCUSSION The experiment sought to measure how discretizing numerical features would affect the support vector machine accuracy and the number of intervals. The best fit for a discretization would be achieving both the highest accuracy and the lowest number of features. In this section we show how each parameter value of the discretization method affects the accuracy and the number of intervals. The number of intervals is calculated using a λ factor that multiplies the sample standard deviation of the feature. λ is a number in the range (0, 1], and by searching for the best λ, it was possible to visualize the accuracy of the model for many number of intervals. In the Figure 4, λ was being decreased geometrically, by dividing it for 10 at each iteration. It was possible to notice that by increasing the number of intervals the accuracy was not improved. This is an interesting finding since the highest number of intervals would give the best approximation of the feature distribution.

Fig. 4.

Accuracy by Number of Thresholds when search for λ decreasing it geometrically.

The next step was to search for better λ by linearly decreasing. The Figure 5 presents this search. By increasing the number of intervals in the Australian Credit Dataset, it is possible to notice a decreasing behavior. Although, values higher than the continuous features was achieved. The German Credit Dataset could not provide a better accuracy than the baseline (I). The Figure 5 also presents the results for Wine Dataset. Although the accuracies were below the continuous features approach accuracy, there were observations close to its value. This result motivated a comparison of the continuous approach with the best discrete approach. A paired t-test was applied in order to statistically validate this comparison. The null hypothesis states that both models have same average values for the accuracy. Each accuracy average value was calculated using the validation described in 3.4.

Fig. 5.

Accuracy by Number of Thresholds when searching for λ decreasing it linearly.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

40

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Discretization Methods and Their Impact on Models Explanation

·

7

The Table II presents the best accuracy of the model using DD over the data and the accuracy of SVM not using it. The accuracy was calculated in the 10-Fold Cross Validation, for the same observations. None p-value could achieve a probability in order to reject the null hypothesis. This finding is an interesting evidence which could reinforces the idea that the discretization applied was successful. Table II.

Comparison of the best DD λ with the non discrete approach.

Dataset

Accuracy Continous

Accuracy (DD)

p-value

0.8695 0.7500 0.9888

0.8608 0.721 0.9722

0.8384 0.3931 0.3305

Australian Credit Dataset GermanCredit Dataset Wine Dataset

The Table III presents the number of features after the application of DD. These number of features are related to the best results presented in Table II.

Table III. Dataset

Number of Features after application of DD.

Original Number of Features

Number of Features after DD

35 49 14

73 70 1174

Australian Credit Dataset German Credit Dataset Wine Dataset

5. CONCLUSION This work proposed to evaluate a novel discretization method based on the sample standard deviation of a feature. By discretizing a feature it is possibility to enhance the interpretability of the prediction, and that possibility motivated this work. SVM was used to perform the classification task, and its accuracy was used to evaluate the discretization. Also, 3 well-known datasets were used to perform the classification. The results showed that even discretizing the feature, it is possible to enhance the results. Although, it is important to understand that this process could lead to a loss of information. For future works we expect to compare our method with other unsupervised discretization methods presented in [Garcia et al. 2013]. Also, LIME framework will be used to explain the predictions, by providing a feedback of how far the value of the feature is from the mean value, if this is the reason of the prediction. REFERENCES Andrzejak, A., Langner, F., and Zabala, S. Interpretable models from distributed data via merging of decision trees. In Computational Intelligence and Data Mining (CIDM), 2013 IEEE Symposium on. IEEE, pp. 1–9, 2013. Bredensteiner, E. J. and Bennett, K. P. Multicategory classification by support vector machines. Computational Optimization and Applications 12 (1): 53–80, 1999. Carneiro, H. C., Franc ¸ a, F. M., and Lima, P. M. Multilingual part-of-speech tagging with weightless neural networks. Neural Networks vol. 66, pp. 11–21, 2015. Chmielewski, M. R. and Grzymala-Busse, J. W. Global discretization of continuous attributes as preprocessing for machine learning. International Journal of Approximate Reasoning 15 (4): 319 – 331, 1996. Rough Sets. Dougherty, J., Kohavi, R., Sahami, M., et al. Supervised and unsupervised discretization of continuous features. In Machine learning: proceedings of the twelfth international conference. Vol. 12. pp. 194–202, 1995. Eggermont, J., Kok, J. N., and Kosters, W. A. Genetic programming for data classification: Partitioning the search space. In Proceedings of the 2004 ACM symposium on Applied computing. ACM, pp. 1001–1005, 2004. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

41

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

B. Stearns, F. Rangel, F. Firmino, J. Oliveira

´ ez, J. A., Lopez, V., and Herrera, F. A survey of discretization techniques: Taxonomy Garcia, S., Luengo, J., Sa and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25 (4): 734–750, 2013. Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al. A practical guide to support vector classification, 2003. Hsu, C.-W. and Lin, C.-J. A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks 13 (2): 415–425, 2002. Joachims, T. Text categorization with Support Vector Machines: Learning with many relevant features. In C. N´ edellec and C. Rouveirol (Eds.), Machine Learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21–23, 1998 Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 137–142, 1998. Kappaun, A., Camargo, K., Rangel, F., Firmino, F., Lima, P. M. V., and Oliveira, J. Evaluating binary encoding techniques for wisard. In Intelligent Systems (BRACIS), 2016 5th Brazilian Conference on. IEEE, pp. 103–108, 2016. Kuhn, H. W. and Tucker, A. W. Nonlinear programming. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley, Calif., pp. 481–492, 1951. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research vol. 12, pp. 2825–2830, 2011. Refaeilzadeh, P., Tang, L., and Liu, H. Cross-validation. In Encyclopedia of database systems. Springer, pp. 532–538, 2009. Ribeiro, M. T., Singh, S., and Guestrin, C. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 1135–1144, 2016. Vapnik, V. N. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. Vellido, A., Mart´ın-Guerrero, J. D., and Lisboa, P. J. Making machine learning models interpretable. In ESANN. Vol. 12. pp. 163–172, 2012.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

42

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Um Método para Predição de Ligações em Redes Complexas Baseado em Históricos da Topologia de Grafos E. Florentino and R. Goldschmidt Instituto Militar de Engenharia, Brazil {erick.florentino, ronaldo.rgold}@ime.eb.br Resumo. A predição de ligações, uma das mais importantes tarefas na análise de redes complexas, procura identificar pares de elementos da rede não conectados e que tenham propensão para se interligar no futuro. Diversos métodos para implementar essa tarefa têm sido desenvolvidos. Em geral, eles se baseiam na estrutura atualizada da rede, deixando de considerar informações históricas sobre como era a topologia da rede nos momentos em que as ligações existentes foram sendo inseridas na estrutura. O presente trabalho se baseia na hipótese de que resgatar tais informações contribui para construir modelos preditivos mais precisos do que os existentes, uma vez que elas enriquecem a descrição do contexto da aplicação com exemplos que retratam justamente o tipo de evento que se deseja prever: o surgimento de novas conexões. Assim sendo, este artigo tem como objetivo descrever um método de predição de ligações que se baseia no histórico de evolução das topologias de redes complexas. Resultados obtidos em um experimento preliminar revelaram indícios de adequação do método proposto. Categorias e Descrições de Assunto: I.2.6 [Artificial Intelligence]: Learning; I.5.3 [Clustering]: Algorithms, Similarity Measures; E.1 [Data Structures]: Graphs and Networks Palavras-chave: redes complexas, clusterização de dados, predição de ligações, aprendizado de máquina, mineração de redes sociais

1.

INTRODUÇÃO

A análise de redes complexas1 tem recebido grande atenção tanto da academia quanto da indústria nos últimos anos [Wang et al. 2015]. Em geral, esse tipo de análise busca compreender como as estruturas dessas redes evoluem ao longo do tempo [Dorogovtsev and Mendes 2002]. Por exemplo, tentar prever se dois vértices de uma rede vão se conectar no futuro é uma importante tarefa de análise de redes complexas conhecida como predição de ligações [Soares and Prudêncio 2013]. Atualmente existem diversos métodos para prever ligações [Wang et al. 2015]. Muitos deles seguem a chamada abordagem de predição de ligações não supervisionada. O procedimento usual comum aos métodos dessa abordagem consiste em aplicar métricas de similaridade para computar scores que expressem algum tipo de grau de compatibilidade2 entre os vértices de cada par não conectado (e.g. homofilia, laços, graus de separação, entre outros). Em seguida, tais pares de vértices são listados em ordem decrescente de score e aqueles no topo da lista (os top-N ) são recomendados como os que contém vértices com maior propensão a se conectarem [Liben-Nowell and Kleinberg 2007]. Número de vizinhos comuns, coeficiente de Jaccard, anexo preferencial e índice Adamic-Adar são exemplos 1 Uma

rede complexa, no contexto deste trabalho, é um grafo altamente interconectado, onde cada vértice representa um item da rede (e.g. pessoa, página da web, foto, empresa, grupo, etc) e cada aresta representa algum tipo de interação entre os itens por ela conectados (e.g. amizade, colaboração, comunicação, etc). 2 Um valor numérico que procura descrever resumidamente as propriedades compartilhadas pelos dois vértices.

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

43

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

E. Florentino and R. Goldschmidt

típicos de métricas de similaridade frequentemente utilizados no cálculo de score [Wang et al. 2015]. Em geral, os métodos de predição de ligações aplicam as métricas de similaridade sobre a topologia (estrutura) da rede sempre completa e atualizada com todas as arestas existentes [Wang et al. 2015]. Sendo assim, deixam de considerar informações sobre como era a estrutura da rede no instante em que cada aresta foi criada.3 O presente trabalho baseia-se na hipótese de que resgatar tais informações possa contribuir para construir modelos preditivos mais precisos do que os existentes, uma vez que elas enriquecem a descrição do contexto da aplicação com exemplos que retratam justamente o tipo de evento que se deseja prever: o surgimento de novas arestas. Diante do exposto, este artigo tem como objetivo apresentar um método de predição de ligações não supervisionada baseado em informações históricas que descrevem como era a topologia de uma rede complexa no momento em que cada aresta foi incluída na rede. Uma vez resgatadas pelo método proposto, tais informações são clusterizadas (agrupadas) a fim de identificar padrões de similaridade de topologia entre os eventos de surgimento de arestas. As densidades dos clusters (grupos) identificados são, então, utilizadas no cálculo dos scores associados aos pares de vértices não conectados. Os resultados obtidos em um experimento preliminar envolvendo uma rede de coautoria de publicações muito utilizada em trabalhos de predição de ligações se mostraram promissores, fornecendo indícios de adequação do método proposto. 2.

MÉTODO PROPOSTO

Seja uma rede complexa representada por um grafo G(V, E), um instante de tempo t0 , um algoritmo de clusterização de dados C e um conjunto de métricas de similaridade D = {di /di : V × V → R}i=1,...,n . Cada aresta e ∈ E possui um atributo e.t que contém o momento em que e foi inserida na rede. Denominado HE, o método proposto possui as seguintes etapas4 : —Etapa I) Particionamento da rede - Esta etapa é responsável por dividir G(V, E) em dois subgrafos denominados GT reino (V, EOld ) e GT este (V, EN ew ), onde EOld (resp. EN ew ) contém todas as arestas e tais que e.t ≤ t0 (resp. e.t > t0 ). —Etapa II) Resgate do histórico de evolução da rede - Para cada aresta e ∈ EOld , esta etapa identifica o subgrafo de GT reino , denominado GT reino0 (V, Eold0 ), tal que ∀a ∈ EOld , a.t ≤ e.t → a ∈ EOld0 . GT reino0 contém todas as arestas inseridas na rede G até o instante e.t. Em outras palavras, GT reino0 contém a estrutura que G dispunha no exato momento em que a aresta e surgiu. Em seguida, os vértices u e v conectados por meio de e são recuperados e cada uma das métricas de similaridade de D é aplicada, formando um registro de dados com a seguinte estrutura: (d1 (u, v), d2 (u, v), ..., dn (u, v)). Tal estrutura descreve de forma sumarizada a topologia do grafo no momento em que surgiu a ligação entre u e v. Caracteriza, desta forma, o evento de surgimento da aresta e. Ao final desta etapa, é produzido um dataset H(d1 ,d2 ,...,dn ) denominado de histórico de evolução da rede. Ele contém os registros de dados sobre a topologia de G que retratam todos os momentos em que surgiram arestas desde a criação da rede até o instante t0 . —Etapa III) Clusterização do histórico de evolução da rede - Nesta etapa, aplica-se o algoritmo de clusterização de dados C sobre H(d1 ,d2 ,...,dn ) , gerando um agrupamento com k clusters {s1 , s2 , ..., sk }. Cada cluster sj delimita uma região no hiperespaço caracterizada pelo surgimento de arestas durante a evolução de G. Quanto mais densa e populosa for essa região, mais representativo é o cluster correspondente e maior a expectativa de que novas arestas que surjam no grafo possam ser 3 Até

onde foi possível observar, o método proposto por [Soares and Prudêncio 2013] é única exceção dentre os citados. Ele extrai informações históricas sobre como era a topologia do grafo em momentos escolhidos pelo usuário. Assim, nem a completude e nem a representatividade do conjunto de informações extraídas podem ser asseguradas pelo método. 4 O HE é uma adaptação do procedimento usualmente adotado pelos métodos da abordagem de predição de ligações não supervisionada resumido na seção 1. As etapas II e III não existem no procedimento usual. A etapa IV faz parte do procedimento usual mas não se restringe à aplicação de uma única métrica de similaridade. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

44

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Um Método para Predição de Ligações em Redes Complexas Baseado em Históricos da Topologia de Grafos

·

3

associadas ele. Assim sendo, as densidades de todos os clusters são calculadas e armazenadas para apoiar as etapas seguintes na identificação do surgimento de novas arestas. —Etapa IV) Cálculo do potencial de interligação de vértices - Para cada par de vértices u e v não conectados de GT reino , são calculadas as métricas de similaridade de D, formando um registro de dados com a estrutura: p = (d1 (u, v), d2 (u, v), ..., dn (u, v)). O registro p passa a representar o par de vértices u e v, descrevendo de forma sumarizada a topologia do grafo em relação a eles no instante t0 . Então, verifica-se em qual cluster p deve ser alocado. Para tanto, a distância de p ao centróide de cada cluster é calculada. O ponto p é alocado ao cluster s cuja distância de p ao seu centróide é a menor dentre todas as distâncias. Desta forma, obtem-se o cluster ao qual p mais se assemelha. Em seguida, calcula-se o Score(u, v) por meio da eq. 1, onde Dens(s) é a densidade do cluster s normalizada pela maior densidade entre todos os clusters, Centro(s) é o centróide do cluster s e Cos(p, Centro(s)) é a métrica de similaridade do cosseno aplicada a p e Centro(s). Pares de vértices não conectados alocados em clusters mais densos e populosos devem recebem maior score, uma vez que estão em regiões com maior concentração de registros históricos de surgimento de arestas. Adicionalmente, o score de cada par de vértices é incrementado em função da similaridade desse par em relação ao centro de massa do referido cluster. Uma vez calculados os scores para todos os pares de vértices não conectados em GT reino , esses são listados em ordem decrescente. Score(u, v) = Dens(s) + Cos(p, Centro(s))

(1)

—Etapa V) Avaliação de desempenho - A partir da lista de pares de vértices ordenados produzida na etapa anterior, o método proposto seleciona os N primeiros elementos (os top-N ) a fim de verificar quantos deles, de fato, se conectaram em GT este . A partir da quantidade de acertos, calcula-se o fator de melhoria do método proposto em relação ao preditor randômico (PR)5 . 3.

EXPERIMENTO E RESULTADOS PRELIMINARES

A fim de avaliar o método proposto, foi realizado um experimento inicial com uma rede de coautoria de publicações, a gr-qc. Representada por um grafo que contém informações sobre autores (vértices) e publicações (arestas) da seção de relatividade geral e cosmologia quântica do arXiv, esta rede tem sido usada em experimentos de diversos trabalhos sobre predição de ligações [Liben-Nowell and Kleinberg 2007], [Muniz et al. 2017] e [Soares and Prudêncio 2013]. Além desta característica, a escolha pela gr-qc deveu-se ao fato dela ser uma rede relativamente pequena (5.167 autores e 25.526 publicações ocorridas entre os anos de 1992 e 2002) e, portanto, adequada para testes preliminares. No experimento, foram considerados três cenários de avaliação distintos. Em cada cenário, a rede foi particionada em um intervalo de treino e outro de teste, conforme indicado na tabela 1. As métricas de similaridade utilizadas nos três cenários foram: Anexo Preferencial (PA), Vizinhos em Comum (CN ) e Coeficiente de Jaccard (JC ). Sua escolha deveu-se basicamente à sua popularidade em trabalhos de pesquisa envolvendo predição de ligações. Tanto o método proposto (HE ) quanto o procedimento usual da abordagem não supervisionada (chamado neste trabalho de método clássico)6 foram aplicados considerando as referidas métricas. A fim de distinguir cada instância de método (método configurado para operar com uma ou mais métricas de similaridade), foi convencionada a seguinte notação para o método proposto: HE(d1 ,...,dn ) . Tal notação indica que HE foi aplicado sobre o histórico de evolução H(d1 ,...,dn ) conforme definido na seção 2. Para o método clássico, a notação utilizada foi Md , onde d ∈ D.7 5 O desempenho de PR pode ser estimado, de forma simplificada, por |E N ew |/(|V × V | − |EOld |). O fator de melhoria de um método em relação a PR expressa quantas vezes tal método foi mais preciso do que PR em prever novas ligações [Liben-Nowell and Kleinberg 2007]. 6 Uma descrição resumida desse procedimento pode ser obtida na seção 1. 7 Cabe ressaltar que o método clássico pressupõe a utilização de uma única métrica de similaridade. De forma a refletir esta restrição, cada instância de M só pode ser configurada com apenas uma métrica d.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

45

5th KDMiLe – Proceedings

·

4

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

E. Florentino and R. Goldschmidt

Para a clusterização de dados da etapa III do método proposto, foi adotado o k-Means. Esta escolha deveu-se basicamente à simplicidade e à disponibilidade de implementação do referido algoritmo. A tabela 1 consolida os resultados de todas as instâncias de método avaliadas nos três cenários do experimento realizado. Cada valor associado a uma instância de método em um cenário reflete o fator de melhoria do desempenho dessa instância em relação ao desempenho do preditor randômico (PR) no cenário indicado. Todas as instâncias de métodos avaliadas apresentaram desempenho superior ao do preditor randômico, com exceção daquelas que utilizaram a métrica de similaridade PA. Também pode-se observar que, em todos os cenários, as instâncias do método proposto que utilizaram métricas de similaridade individualmente para representar o histórico de evolução da rede não foram capazes de superar as instâncias do método clássico que utilizaram as métricas correspondentes, i.e., o desempenho de HE(d) foi sempre inferior ao desempenho de M(d) , para qualquer d ∈ D. Em função do baixo desempenho obtido pelas instâncias dos métodos com a métrica PA, optou-se por descartá-la e considerar apenas as CN e JC para fins de combinação de métricas para representar o histórico de evolução da rede. Desta forma, apenas uma instância do método proposto com combinação de métricas foi avaliada: HE(CN,JC) . Tal instância foi aplicada nos três cenários. Em todos eles, seu desempenho foi superior ao de todas as demais instâncias, fornecendo indícios de que o método proposto pode produzir bons resultados quando configurado para considerar diferentes métricas de similaridade para representar o histórico de evolução de uma rede. 1 2 3

Gtreino [1995, 1997] [1992, 1997] [1992, 1997]

Gteste [1998, 1998] [1998, 1998] [1998, 2002]

PR 0,004 0,006 1,263

HE(JC) 4,10 2,85 1,08

HE(P A) 0,46 0,00 0,13

HE(CN ) 2,50 1,55 1,03

HE(CN,JC) 4,78 3,11 1,32

MJC 4,55 2,85 1,08

MP A 0,23 0,26 0,32

MCN 3,64 2,07 1,26

Tabela 1: Resultados do experimento realizado sobre a rede qr-qc em 3 cenários.

4.

CONSIDERAÇÕES FINAIS

Este trabalho apresentou um método de predição de ligações em redes complexas que resgata informações históricas que descrevem de forma sumarizada como era a topologia da rede no momento em que cada aresta foi inserida na estrutura do grafo. Diferentemente dos métodos existentes, o método proposto procura enriquecer o contexto do problema com informações sobre o tipo de evento que se deseja prever: o surgimento de novas arestas. Os resultados preliminares obtidos em um experimento envolvendo uma rede de coautoria de publicações mostraram indícios de adequação do método proposto. Como trabalhos futuros, destacam-se a avaliação do desempenho do método proposto aplicado em outras redes, assim como a análise do efeito da combinação entre mais métricas de similaridade para descrever os eventos de surgimento de arestas. REFERENCES Dorogovtsev, S. N. and Mendes, J. F. Evolution of networks. Advances in physics 51 (4): 1079–1187, 2002. Liben-Nowell, D. and Kleinberg, J. M. The link-prediction problem for social networks. JASIST 58 (7): 1019– 1031, 2007. Muniz, C. P. M. T., Choren, R., and Goldschmidt, R. R. Using a time based relationship weighting criterion to improve link prediction in social networks. In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,. INSTICC, ScitePress, pp. 73–79, 2017. Soares, P. R. and Prudêncio, R. B. Proximity measures for link prediction based on temporal events. Expert Systems with Applications 40 (16): 6652–6660, 2013. Wang, P., Xu, B., Wu, Y., and Zhou, X. Link prediction in social networks: the state-of-the-art. SCIENCE CHINA Information Sciences 58 (1): 1–38, 2015.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

46

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Um método para identificação de pessoas em cenários de risco em ambientes de segurança crítica – uma análise experimental em ambientes offshore Felipe Oliveira1 , Flavia Bernardini1 , Marcilene Dianin Viana1 Instituto de Ciência e Tecnologia, Universidade Federal Fluminense (UFF) [email protected], {fcbernardini,marcilenedianin}@id.uff.br Abstract. O Regime de Segurança Operacional, estabelecido por meio da Resolução da Agência Nacional de Petróleo no 43/2007 apresenta, anualmente, relatórios de segurança fundamentados com dados estatísticos, referentes à análise de desempenho da segurança das operações ligadas às atividades de exploração e produção de hidrocarbonetos. Um dos grandes problemas no ambiente offshore está relacionado aos acidentes de trabalho, devido à queda de objetos em operações de movimentação de carga e trabalho em altura. Assim, identificar pessoas em situações de risco nesse tipo de ambiente é bastante importante nesse contexto. No entanto, nesses ambientes, a movimentação de carga pode ocorrer em qualquer ponto de uma embarcação, o que dificulta a instalação de sensores de movimento. Assim, a Visão Computacional pode ser utilizada para esse fim. O objetivo deste trabalho é propor um método para identificação de pessoas em situações de risco utilizando técnicas de visão computacional. Tal ferramenta auxilia no registro e análise das ações humanas, por meio de câmeras. Os resultados obtidos usando dados reais com o método proposto são considerados promissores. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications; I.2.6 [Artificial Intelligence]: Learning Keywords: OpenCV. Detector de humanos LBP, AdaBoost, Segurança no trabalho

1.

INTRODUÇÃO

Nas últimas décadas, sistemas computacionais baseados em visão computacional com elevado grau de complexidade têm sido utilizados em diversas áreas de aplicação, como por exemplo, a medicina onde um manipulador robótico é utilizado para acompanhar e auxiliar médicos durante procedimentos cirúrgicos complexos. Em particular, toda operação executada em plantas de produção, inclusive em ambientes offshore, envolve riscos ao ser humano. Tais riscos são controlados por meio de boas práticas de segurança. Ainda assim, acidentes podem ocorrer, uma vez que o fator humano é suscetível a falhas. Uma possibilidade para identificação de pessoas em cenários de risco humano é a utilização de sensores de movimento. No entanto, no ambiente offshore, a movimentação de cargas em altura pode ocorrer em qualquer ponto da embarcação, além de possuir um espaço bastante restrito. Ambas as questões dificultam a instalação de tais sensores. Assim, o uso de visão computacional aparenta ser mais indicado nesse contexto. O presente trabalho tem como objetivo apresentar um método para identificação de pessoas em situações de risco utilizando técnicas de Visão Computacional. Foi utilizado um algoritmo de aprendizado de máquina em cascata, com o intuito de melhorar a segurança de trabalhos condicionados a isolamento temporário de algumas áreas em ambientes offshore, por meio do monitoramento e análise das ações humanas, utilizando câmeras. Pretende-se influenciar diretamente na erradicação de in-

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

47

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Oliveira, Bernardini & Viana

cidentes nesses ambientes, por meio de um sistema capaz de identificar pessoas que transitem em ambientes operacionais isolados temporariamente, por medidas de segurança. Devido ao cenário desafiador que é o ambiente offshore e suas particularidades, toda base de dados para o treinamento do sistema (amostras positivas e negativas) foi obtida in loco, resultando assim em uma base de dados reais de imagens da área offshore. Para avaliação do método, foi utilizada a plataforma OpenCV, que disponibiliza a ferramenta de utilização do algoritmo de aprendizado de máquina. Este trabalho está dividido como segue: Na Seção 2, são apresentados os trabalhos relacionados que utilizam técnicas ou métodos para detecção de pessoas em diversas áreas de aplicações a fim de fomentar a base de pesquisa para o problema abordado. Na Seção 3 é apresentado o problema em investigação, baseado em dados reais providos pela ANP (Agência Nacional de Petróleo), algumas imagens que mostram a complexidade do problema abordado, e o método proposto. Na Seção 4 são apresentados os resultados da análise experimental obtida. Na Seção 5 são apresentadas as conclusões do trabalho, incluindo uma análise das contribuições e delimitações deste trabalho, e trabalhos futuros. 2.

REFERENCIAL TEÓRICO

Diversas técnicas de detecção de pessoas por meio de sistemas computacionais têm sido estudadas e propostas na literatura [Gouda and Nayak 2015]. Dentre estes trabalhos, diversos autores se aprofundam nas limitações de alguns métodos [Moeslund and Granum 2001], a fim de aperfeiçoar o desempenho da técnica em análise. Com isso surgem novos métodos e propostas que viabilizam a utilização de determinados algoritmos de detecção de pessoas. Segundo [Mehmet and Bulent 2004], a iluminação influencia o processo de limiarização (thresholding), pois altera o histograma da imagem original, eliminando eventualmente regiões de vale entre dois picos, as quais poderiam ser utilizadas para definição de um limiar. A limiarização consiste em identificar, em uma imagem, um limiar de intensidade, em que o objeto melhor se distingue do fundo como mostra [Aboud et al. 2008]. Eles também descrevem que a etapa de aquisição de imagens exige que o desenvolvedor considere os requerimentos da aplicação e isole as características de interesse do objeto para enfatizar e gravar as características encontradas. Se a imagem obtida não for satisfatória, a tarefa pretendida pode não ser possível, mesmo com o auxílio de alguma técnica de melhoria de imagem. Os autores de [Viola and Jones 2001] apresentam um algoritmo de aprendizado de máquina para detecção de objetos, baseado na abordagem em cascata (cascade classifier), capaz de processar imagens de maneira extremamente rápida, atingindo altas taxas de precisão na detecção. Como em qualquer problema para cuja solução envolve o uso de algoritmos de aprendizado de máquina, há dois estágios principais, que são treinamento do classificador e seu uso para detecção de objetos. A técnica conhecida como LBP (Local Binary Pattern) tem provado ser efetiva como um poderoso descritor local para microestruturas de imagens [Liao et al. 2007]. O operador LBP rotula os pixels de uma imagem limitando a vizinhança de cada pixel e considera o resultado como uma string binária ou um número decimal. 3.

O PROBLEMA DE SEGURANÇA EM AMBIENTE OFFSHORE

O Regime de Segurança Operacional estabelecido através da Resolução da Agência Nacional de Petróleo no 43/2007 apresenta, anualmente, relatórios de segurança fundamentados com dados estatísticos referentes à análise de desempenho da segurança das operações ligadas às atividades de exploração e produção de hidrocarbonetos. Toda e qualquer operação executada em instalações de produção, intervenção em poços e perfuração envolve riscos advindos do cenário desafiador conhecido como offshore. A ANP (Agência Nacional de Petróleo) exige que as operadoras mantenham estes riscos controlados de maneira a assegurar a utilização de boas práticas de segurança e assim manter os mais altos padrões de HSE (Health and Safety Executive). Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

48

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe — Symposium on Knowledge Discovery, Mining and Learning

·

3

Todo profissional offshore recebe treinamento e orientações adequadas no intuito de prevenir ocorrências e repetição de acidentes. Mesmo com todos os cuidados aplicáveis, a ocorrência de incidentes é inevitável uma vez que o fator humano está determinantemente suscetível a falhas. Muitas vezes, tais incidentes ocorrem quando o ser humano não respeita as regras básicas de segurança, por falha ou por negligência. Quando alguma atividade ou operação crítica acontece, avisos no sistema de áudio são emitidos para todos a bordo. Além disso, a área envolvida no trabalho é isolada (por meio da utilização de barreiras com fitas indicativas), para que somente pessoas autorizadas circulem no local na hora e momento adequado. Entretanto, muitas vezes os avisos e regras básicas de segurança são ignorados. Então, a área, mesmo que isolada, fica suscetível à ocorrência de incidentes. Na Figura 1 é ilustrado um exemplo típico de situação de risco de incidentes onde um trabalhador está desempenhando suas tarefas dentro de uma área claramente isolada, devido a uma atividade de trabalho em altura no mesmo local.

Fig. 1.

Exemplo de área isolada com barreira (fita amarela) em uma unidade de perfuração offshore.

A detecção de pedestres se vê como uma aplicação desafiadora, devido a algumas mudanças ocasionais que podem ocorrer no ambiente em análise. Durante todo o dia, mudanças múltiplas de iluminação acontecem. Isso afeta diretamente a qualidade da imagem que, às vezes, torna-se tão ruim ao ponto que as informações não podem ser mais recuperadas. Além disso, variações de postura dos humanos, formas corporais e variações no vestuário são características variantes encontradas no ambiente de detecção de humanos. Por isso, o objeto de interesse nem sempre possui as mesmas características [Kawade 2015]. 4. UM MÉTODO PARA DETECÇÃO DE PESSOAS EM CENÁRIOS DE RISCO Na Figura 2 são ilustrados os passos do método proposto neste trabalho. No primeiro passo, as imagens devem ser coletadas in loco. Isso é importante para que sejam respeitadas as características reais do ambiente de monitoramento, como iluminação, textura, dentre outros aspectos. Em seguida as imagens são pré-processadas, recortando-as para se ter o objeto de interesse para treinamento (o ser humano caminhando sobre uma embarcação), e são rotuladas como positivas (quando há o objeto de interesse) e negativas (quando não há o objeto de interesse na imagem). Ainda, características são extraídas para serem utilizadas como entrada do algoritmo de aprendizado que constrói o classificador a ser utilizado. É importante observar que muitos dos algoritmos apresentados na literatura embutem a extração de características e a construção de classificadores denominados “fracos”. A partir desses classificadores fracos, é comum o uso de métodos de combinação de classificadores, para que esses classificadores fracos sejam combinados em um classificador forte. O classificador “forte” é então avaliado em um Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

49

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Oliveira, Bernardini & Viana

conjunto de teste, que não foi utilizado no processo de treinamento. Após o treinamento e avaliação do classificador, o mesmo é utilizado para o processo de detecção. Nessa etapa, a detecção de pessoas deve ser realizada em tempo real.

Fig. 2.

Passos do Método para Identificação de Pessoas em Ambiente Offshore.

Para avaliação do método proposto, foi utilizada uma ferramenta de código aberto para visão computacional denominada OpenCV. O ponto crítico do método está na qualidade dos classificadores construídos. Sendo assim, neste trabalho é focada a avaliação do método nesta etapa. O algoritmo proposto em [Viola and Jones 2001], descrito na seção anterior e distribuído pela biblioteca OpenCV é extensivamente utilizado para detecção de faces. Eles utilizam o algoritmo de construção de classificadores denominado Adaboost, para construção de classificadores em cascata. A vantagem do Adaboost está em construir diversos classificadores fracos para combiná-los e construir um classificador forte. Métodos para treinamento de classificadores estão disponíveis na biblioteca OpenCV para realização de treinamento dos classificadores em cascata. A aplicação disponível na plataforma OpenCV denominada traincascade, fornece suporte para dois tipos de métodos de combinação de características, denominados Haar [Viola and Jones 2001] e LBP [Liao et al. 2007]. Diversos estudos na literatura apontam que alguns métodos têm sido aplicados com êxito em aplicações de detecção de pedestres, Região de Covariância [Tuzel et al. 2007; Paisitkriangkrai et al. 2008] e Haar wavelets [Viola et al. 2003]. O HOG é o método de combinação de características que mais se destaca e apresenta melhor desempenho para detecção de pessoas pois ele captura a estrutura de borda ou gradiente que é característica da forma local. No entanto, HOG pode apresentar limitações no seu desempenho se o fundo das imagens contem bordas ruidosas [Wang et al. 2009]. Além disso, o cálculo das características do HOG é computacionalmente pesado. Não obstante, a biblioteca para o método de combinação de características HOG foram descontinuadas na plataforma OpenCV a partir da versão 3.0. Sendo assim, para este trabalho o método de combinação de características escolhido para criação dos classificadores foi o LBP (Local Binary Patern) e o algoritmo Gentle AdaBoost para construir classificadores em cascata rápidos, uma vez que o tempo de treinamento do classificador nessas condições é inúmeras vezes mais rápido do que os recursos Haar [Viola and Jones 2001]. O LBP é um recurso de textura computacionalmente eficiente comumente aplicado na classificação de textura e detecção de faces. No entanto, o recurso possui duas limitações para detecção humana: (i) O LBP diferencia um fundo escuro contra um humano brilhante e vice versa. Tal distinção é indesejável no âmbito de aplicações de detecção de pedestres; e (ii) Invariante ao contraste e iluminação, logo o método não consegue identificar a diferença entre uma região local de baixo contraste e uma região semelhante de contraste forte. Para se obter melhores resultados nas aplicações que envolvam o Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

50

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe — Symposium on Knowledge Discovery, Mining and Learning

·

5

método de combinação de característica LBP, as limitações citadas podem ser amenizadas aplicandose algumas técnicas de pré-processamento antes da fase de treinamento [Kawade 2015].

Fig. 3. Imagem original à esquerda sem pré-processamento, ao meio a imagem em escala de tons de cinza e a direita a equalização do seu histograma.

Fig. 4.

Histograma em RGB da imagem original sem pré-processamento.

Neste trabalho, para variações de contraste, todas as amostras positivas tiveram sua variância normalizada a fim de se reduzir suas variações de contraste. Na Figura 3 é exibida uma amostra positiva a ser utilizada no treinamento sem a variância normalizada. Em seguida, a mesma imagem em escala de tons de cinza e logo depois, a equalização do histograma para caráter demonstrativo. Na Figura 4 são exibidos os histogramas em RGB da imagem original. A equalização do histograma faz parte do pré-processamento das imagens para construção do classificador e predição, uma vez que melhores resultados são obtidos por meio dessa técnica de pré-processamento [Kawade 2015]. Na Figura 5 é exibido o resultado da normalização da variância na imagem original. Essa técnica de extração de características é utilizada por Região de Covariância [Tuzel et al. 2007; Paisitkriangkrai et al. 2008] e Haar wavelets [Viola et al. 2003]. Com tal método é possível observar que as variações de contraste são reduzidas significativamente. Isso pode ser constatado também ao se comparar os seus histogramas em RGB e tons de cinza, como mostrado nas Figuras 6 e 7.

Fig. 5. Imagem original à esquerda com sua variância normalizada, ao meio a imagem em escala de tons de cinza e a equalização do seu histograma mais a direita.

Para construção dos classificadores, foram utilizadas amostras positivas e negativas obtidas no local de aplicação (ambientes offshore de segurança crítica), o que aumenta a confiabilidade do sistema, como afirma Dalal e Triggs (2005). Existem vários conjuntos de dados disponibilizados na web por Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

51

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Oliveira, Bernardini & Viana

Fig. 6.

Fig. 7.

Histograma em RGB da imagem original com sua variância normalizada.

Histograma da imagem original em tons de cinza com sua variância normalizada.

outros pesquisadores. Entretanto, esses conjuntos de dados tendem a ser bem específicos para problemas individuais e podem não ser apropriados para avaliar o tema de interesse deste trabalho caso a aplicação apresente pequenas diferenças de forma, como por exemplo, reconhecer o mesmo objeto em angulações diferentes (o que eventualmente mudará a forma do objeto e consequentemente suas características). Assim, para avaliação do método proposto, foram coletadas imagens específicas do domínio abordado. 5.

ANÁLISE EXPERIMENTAL

Base de Dados: Foram coletados conjuntos de sequências de vídeo com uma pessoa caminhando em diversos caminhos distintos em uma embarcação em ambiente offshore. Esses vídeos foram gravados a 30 fps (frames por segundo) e repetidos com pessoas diferentes. Também foram obtidos vídeos de diferentes ângulos e locais no ambiente em que se pretende monitorar, também conhecido como pipedeck. Conjunto de Treinamento: Foram utilizados para o treinamento e teste dos classificadores imagens provenientes dos vídeos, o que significa que todas as amostras coletadas são originárias do local de aplicação da proposta. Dos conjuntos de sequências de vídeos obtidos para este trabalho, um deles foi utilizado para treinamento enquanto a outra sequência de vídeo foi utilizada para testar o detector obtendo-se amostras aleatórias do conjunto. Para treinamento, foi considerada a variação angular da posição do pedestre que pode ocorrer dependendo do posicionamento da câmera no momento de monitoramento. Dessa forma, a sequência de vídeo a qual teve os frames coletados para treinamento estabeleceu caminhos conhecidos que pudessem proporcionar uma variação de posição que mais se aproximasse das possíveis poses de um trabalhador na área em monitoramento. Esses caminhos estão ilustrados na Figura 8 (a) e algumas amostras utilizadas no treinamento são ilustradas na Figura 8 (b). Um total de 1167 amostras positivas e 5855 amostras negativas foram obtidos para treinamento. As amostras positivas foram obtidas nos frames da sequência de vídeo designada para treinamento, numa resolução de 1280x720. Essas por sua vez foram recortadas (a região de interesse) nas dimensões de 120x160 e aplicou-se o a normalização da variância, de maneira a obter uma redução nas variações de contraste. As amostras negativas (imagens que não contém pessoas) foram coletadas no mesmo local, com a mesma resolução de 1280x720, sem conter pessoas nas mesmas e não foram redimensionadas. Cada classificador em cascata foi treinado utilizando 990 imagens positivas de 20x26 (para manter Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

52

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe — Symposium on Knowledge Discovery, Mining and Learning

·

7

Fig. 8. (a) Ilustração das posições utilizadas para treinamento e possíveis inclinações. (b) Pequena amostra de imagens positivas utilizada no treinamento.

a mesma proporção da imagem) o que equivale a 85% do total de amostras positivas. Isso ocorreu para assegurar o treinamento de um bom classificador e permitir com que o treinamento encerrasse ao final de todos os estágios, conforme especificado, sem o risco de overfitting. Foi definido a quantidade de até 20 estágios para treinamento, o que gerou um máximo de 185 classificadores “fracos”, que foram combinados utilizando o algoritmo Gentle AdaBoost, versão mais robusta do Real Adaboost, e utilizada em [Viola and Jones 2001]. Os classificadores fracos utilizando Haar [Viola and Jones 2001] e LBP [Liao et al. 2007] foram construídos utilizando uma máquina com processador core i5-4300U 2.50GHz e 8,00 GB (RAM). O tempo de treinamento de cada classificador levou em torno de 30 minutos para o tipo LBP e 45 minutos para o tipo Haar. Resultados Obtidos: No total, 400 amostras aleatórias (200 com pelo menos um trabalhador na imagem e 200 imagens de áreas externas de embarcações offshore sem nenhum trabalhador) representativas que não foram utilizadas no treinamento foram selecionadas para o conjunto de teste. Avaliou-se o desempenho dos classificadores numa variação de estágios entre 15 a 20 para dois tipos de classificadores diferentes, utilizando Haar e LBP, ambos implementados na biblioteca OpenCV. Classificadores com número de estágios fora desse range apresentaram resultados não satisfatórios, mesmo após ajustes nos parâmetros dos classificadores. Na Tabela I, são mostrados os resultados obtidos utilizando HAAR e LBP somente para 20 estágios (ambos) e 16 estágios (HAAR), pois foram os resultados que apresentaram melhor acurácia. São exibidas as métricas Acurácia, Precisão, Taxa de Verdadeiros Positivos (TPR), Taxa de Falsos Positivos (FPR) e Tempo de Execução para construção de todos os classificadores fracos. É importante observar que, para o classificador LBP de 16 estágios, após os ajustes em seus parâmetros de execução, apresentou a melhor precisão e com menor taxa de falsos positivos (FPR). Os demais classificadores construídos, ocultados dessa tabela, apresentaram desempenho inferior, mesmo após ajustes em seus parâmetros. Table I. Técnica

Resultados Obtidos Usando HAAR e LBP Acurácia Precisão TPR FPR

HAAR (20 estágios + parâmetro padrão) LBP (20 estágios + parâmetro padrão) HAAR (16 estágios + parâmetro padrão) LBP (16 estágios + parâmetro padrão) HAAR (16 estágios + ajustes de parâmetro) LBP (16 estágios + ajustes de parâmetro)

6.

55,75% 69,25% 43,00% 50,25% 46,00% 84,25%

56,91% 43,86% 21,39% 3,55% 7,06% 89,02%

35,00% 84,00% 0,60% 1,00% 90,50% 77,00%

25,72% 66,34% 89,45% 99,94% 99,87% 9,40%

Tempo de Execução 16,762s 20,220s 15,103s 20,541s 14,605s 12,483s

CONCLUSÕES E TRABALHOS FUTUROS

Detectar pessoas em cenários de risco utilizando visão computacional é uma importante aplicação em diversas áreas que envolvem segurança. No ambiente offshore, essa questão é ainda mais importante, Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

53

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Oliveira, Bernardini & Viana

pois o socorro a vítimas é mais complexo por questões de logística, e a periculosidade ambiental é muito maior. O objetivo deste trabalho é apresentar um método baseado em visão computacional para identificação de pessoas em situação de risco. Foram utilizadas imagens reais de um ambiente em plataformas de petróleo offshore. A questão crítica do método está na construção do classificador a ser utilizado, que foi inicialmente avaliada neste trabalho. Diversos algoritmos para extração de características foram encontrados na literatura, e dois foram avaliados — os algoritmos HAAR e LBP. Foi utilizado o algoritmo boosting para combinação de classificadores fracos. Diversos parâmetros foram testados, e uma combinação específica de parâmetros para o algoritmo LBP e o algoritmo de boosting foi a que obteve os melhores resultados — a melhor configuração de detecção do conjunto de teste obteve precisão de 89,02%, acurácia de 84,25%, e taxa de falsos positivos de 9,4%. É importante observar que, no contexto deste trabalho, há uma prioridade em menor quantidade de alarmes do tipo falsos positivos, já que o sistema notifique a qualquer instante se uma pessoa está ou não dentro de uma área de risco, principalmente se este entrar sem autorização em uma área de movimentação de cargas, por exemplo. Os resultados foram considerados bons, pois no contexto do domínio de detecção de pessoas em cenários de risco, verificou-se que a técnica de detecção de pessoas se mostra uma alternativa promissora para reduzir o risco em zonas de perigo iminente. No entanto, ainda assim o presente trabalho apresenta algumas limitações e oportunidades de melhorias. A literatura recomenda avaliar o desempenho de classificadores para detecção utilizando a curva ROC. Assim, pretende-se aplicar tal técnica para ampliar a parte experimental do trabalho e assim obter mais credibilidade nos resultados. Também, pretende-se avaliar no futuro o uso de técnicas para reduzir o desbalanceamento do conjunto de dados, e minimizar a incidência de falsos positivos. Outra questão importante no cenário abordado é a utilização do método proposto em classificação em tempo real, no qual muitas vezes a identificação de pessoas se dá em uma região definida por um polígono (convexo ou não) na área de interesse monitorada, e deve ser realizada no menor tempo possível. Tais requisitos funcionais e não funcionais da solução completa serão tratadas em trabalhos futuros. REFERENCES Aboud, S. R., Dutra, L. V., and Erthall., G. J. Limiarização automática em histogramas multimodais. In Proc. 7th Brazilian Conf. Dynamics Control and Applications, 2008. Gouda, R. and Nayak, J. S. Survey on pedestrian detection, classification and tracking. Int. J. Comp. Science and Info. Tech., 2015. Kawade, R. B. A review on pedestrian detection techniques. Int. J. Science and Research, 2015. Liao, S., Zhu, X., Lei, Z., Zhang, L., and Li, S. Z. Learning multi-scale block local binary patterns for face recognition. In Proc. Int. Conf. Biometrics. pp. 828–837, 2007. Mehmet, S. and Bulent, S. Survey over image thresholding techniques and quantitative performance evaluation. J. Electronic Imaging 13 (1), 2004. Moeslund, T. B. and Granum, E. A survey of computer vision-based human motion capture. Comput. Vision Image Understanding vol. 81, pp. 231–268, 2001. Paisitkriangkrai, S., Shen, C., and Zhang, J. Fast pedestrian detection using a cascade of boosted covariance features. IEEE Trans. Circuits Syst. Video Technol. vol. 18, 2008. Tuzel, O., Porikli, F., and Meer, P. Human detection via classification on riemannian manifolds. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. pp. 1–8, 2007. Viola, P. and Jones, M. J. Rapid object detection using a boosted cascade of simple features. In Proc. Conf. Computer Vision and Pattern Recognition (CVPR). pp. 511–518, 2001. Viola, P., Jones, M. J., and Snow, D. Detecting pedestrians using patterns of motion and appearance. In Proc. IEEE Int. Conf. Computer Vision. pp. 734–741, 2003. Wang, X., Han, T. X., and Yan, S. An hog–lbp human detector with partial occlusion handling. In Proc. IEEE Int. Conf. Comp. Vis., 2009.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

54

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Acoplamento para resolução de correferência em ambiente de aprendizado sem-fim F. Quécole1 , M.C. Duarte2 , E.R. Hruschka1

2

1 Universidade Federal de São Carlos, Brazil [email protected], [email protected] Univ. Lyon, UJM-Saint-Etienne, CNRS Laboratoire Hubert Curien UMR 5516, Saint-Étienne, France [email protected]

Abstract. One of the challenges for never-ending learning language (NELL) systems is to properly identify different noun phrases that denote the same concept in order to maintain the cohesion of their knowledge base. This paper investigate the coupling as an approach for improve coreference resolution on NELL. The obtained results suggests that coupling is a useful and good aproach to achieve better coverage and accuracy in NELL’s base. Categories and Subject Descriptors: I.2.6 [Artificial Intelligence]: Learning Keywords: coreference, coupling, machine learning, never-ending learning

1. INTRODUÇÃO Este trabalho é vinculado aos estudos de leitura da web de forma sem-fim, também chamado de Never-Ending Language Learner (NELL) [Mitchell et al. 2015]. NELL é um sistema de aprendizado sem-fim o qual realiza o aprendizado a partir da extração de fatos da web. NELL está em execução desde 12 de janeiro de 2010 e tem como sede a Universidade Carnegie Mellon (Estados Unidos). NELL objetiva aprender novos fatos, popular e expandir a própria base de conhecimento a partir da web, 24 horas por dia, 7 dias por semana, para sempre. Tendo por base esse contexto, este artigo está voltado ao tratamento de correferências em ambiente de aprendizado sem-fim. Assim, correferências são diferentes entidades nomeadas que denotam a mesma semântica, por exemplo: “Shaquille O’Neal” e “Shaq”. Tal problema é comum em sistemas de extração de informações, como a NELL [Carlson et al. 2010], e o TextRunner [Yates et al. 2007], os quais falham em identificar entidades correferentes. Tal falha impede que o sistema interprete diferentes entidades com o mesmo significado. Com isso, a base de conhecimento pode armazenar entidades como: São Paulo, Sampa e Cidade da Garoa, como diferentes instâncias, reduzindo assim a precisão da base de conhecimento e impedindo que uma maior quantidade de fatos seja extraída utilizando todas três entidades como correferentes. Correferências, no contexto da NELL, podem ser analisadas a partir de aspectos morfológicos e/ou semânticos das entidades nomeadas. Morfologia, no contexto da linguística, bem como deste artigo, investiga o formato das palavras isoladamente (neste artigo, tais palavras referem-se às entidades nomeadas), isto é, sem analisar o contexto nos quais estão inseridas. Semântica refere-se à investigação das relações da base de conhecimento em que as entidades nomeadas estão inseridas. Atualmente a NELL possui um componente chamado ConceptResolver [Krishnamurthy and Mitchell 2011], o qual atua na resolução de correfências com base em relações. Para c Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that Copyright⃝2017 the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

55

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

F. Quécole, M.C. Duarte and E. R. Hruschka

exemplificar melhor, considere a Tabela I.

Relação / EN1

Tabela I. Instãncias de Relações como atributos semânticos Shaquille O’Neal Shaq O’Neal Shaq

athleteAlsoKnownAs athleteLedSportsTeam athletePlaysForTeam athletePlaysInLeague athletePlaysSport

Shaq O’Neal Suns Suns NBA Basketball

Shaq Suns Suns NBA Basketball

Shaq O’Neal Sun Suns NBA Basketball

Stephen Curry

Golden State Warriors Golden State Warriors NBA Basketball

A Tabela I apresenta possíveis instâncias para entidades nomeadas (EN) (colunas), dada a relação (linhas). Por exemplo: para a relação athletePlaysSport, para a entidade nomeada (en1) “Shaquille O’Neal”, apresenta o par “Basketball”, O “-” indica que a entidade nomeada em questão não tem participação naquela relação. As ENs, no contexto da NELL, são instâncias de pessoas, cidades, objetos, eventos, datas, etc. Este artigo é organizado como se segue. Na seção 2 apresenta-se os principais trabalhos correlatos no contexto deste artigo; na seção 3 é investigada a resolução de correferências através de atributos morfológicos, semânticos e em conjunto (com ou sem acoplamento); na seção 4 o resultados são apresentados e discutidos; a seção 5 contém as conclusões e trabalhos futuros 2. NEVER-ENDING LANGUAGE LEARNING A NELL é um sistema que visa aprender ininterruptamente a partir da web. A ideia é que a NELL seja capaz de, a cada dia (ou iteração), aprender mais e melhor. Uma vez que, a cada iteração, o sistema promove alguns dos novos fatos aprendidos (os de maior confiança) e os utiliza como rótulos para a próxima iteração. Para que o aprendizado seja realizado, a NELL possui como entrada uma ontologia, a qual é organizada em categorias, relações e sementes (exemplos) de ambas. As categorias são os tipos de conhecimentos a serem aprendidos, como: cidade(x), país(y), estado(w), pessoa(z), etc., enquanto as relações são os relacionamentos entre as categorias, como: localizadoEm(cidade(x),(país(y)), nasceuEm(pessoa(z),cidade(x), capitalDe(cidade(x),país(y), etc. Para categorias, por exemplo, as sementes poderiam ter como valores de x: São Carlos, Ribeirão Preto, Paris, etc. enquanto que para relações poderiam ter como sementes de x e y respectivamente: São Carlos & Brasil, Ribeirão Preto & Brasil, Paris & França, etc., e assim por diante para as demais categorias e relações. A Figura 1 ilustra uma possível parte da ontologia.

Fig. 1.

Exemplo de um subconjunto da ontologia [Duarte and Hruschka 2014]

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

56

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Acoplamento para resolução de correferência em ambiente de aprendizado sem-fim

·

3

A NELL é baseada em aprendizado semissupervisionado [Zhu 2010] , estando assim suscetível ao semantic-shift [Curran et al. 2007], isto é, conforme novos fatos são aprendidos e incorporados a base de conhecimento (e assim utilizados como dados rotulados em outras iterações), podem ocorrer pequenos desvios semânticos que, se não tratados adequadamente, podem acumular e prejudicar a aprendizagem do sistema. Tais desvios podem ser erros ou alguma mudança de conceito (ex: presidente atual do Brasil). Para que o semantic-shift seja minimizado, a NELL utiliza vários componentes de forma acoplada como apresentado em [Carlson et al. 2009], o que significa que cada um dos componentes realiza o aprendizado a partir de uma visão diferente, e ao final ambos os resultados são combinados para a obtenção de uma decisão mais precisa. Alguns dos principais componentes da NELL são: CML (Coupled Morphologic Learner) [Carlson et al. 2009] responsável pela identificação de entidades nomeadas a partir de análise morfológica; CPL (Coupled Patterns Learner) [Carlson et al. 2010] responsável pela extração de entidades nomeadas a partir de padrões textuais e vice-versa; SEAL (Coupled Set Expander for Any Language) [Carlson et al. 2009] atua como o CPL, porém a partir de padrões HTML; ConceptResolver [Krishnamurthy and Mitchell 2011] responsável pela resolução de correferência no aprendizado de relações; ConceptResolver, principal foco deste artigo, realiza clusterização com base nas similaridades de características semânticas, e assim decide se um par de entidades nomeadas são correferentes; entre outros. 3. INVESTIGAÇÃO DA RESOLUÇÃO DE CORREFERÊNCIAS O atual componente da NELL de resolução de correferência, o ConceptResolver, atua de forma a agrupar entidades nomeadas candidatas à correferência de acordo com as suas ocorrências nas relações da NELL. Por isso, o ConceptResolver consegue encontrar entidades nomeadas, obtendo bons resultados, com base em características semânticas, porém possui dificuldades para quando encontra relações com valores faltantes, como apresentado na Tabela II.

Relação / EN1 athleteHomeStadium athletePlaysForTeam belongsTo athletePlaysInLeague athletePlaysSport hasSpouse

Tabela II. Instâncias de Relações com atributos faltantes Shaquille O’Neal Messi Cristiano Ronaldo Talking Stick Resort Arena NBA Basketball

Barcelona Argentina La Liga Soccer

Santiago Bernabeu Real Madrid Portugual Soccer Irina Shayk

Stephen Curry Warriors NBA Basketball

O tratamento de correferências em sistemas de aprendizado sem-fim, como a NELL, é interessante. Já que o negligenciamento dessas ocorrências poderia causar um problema de representação do conhecimento extraído. Exemplo: o sistema identificaria que “Shaquile O’Neal” é um atleta, bem como que “Shaq” é um atleta, que ambos jogaram basquete, e lideraram o time “Phoenix Suns”, mas não que referenciam a mesma entidade da vida real. O não tratamento dessa equivalência pode fazer com que o sistema aprenda, por exemplo, que “Shaquille O’Neal” jogou na NBA, mas não aprenda que “Shaq” também jogou, mesmo ambos sendo termos que referenciam a mesma entidade. Ou seja, mesmo a informação estando contida no sistema, ela não é bem referenciada, ocasionando falhas que poderiam ser sanadas pelo tratamento de correferências. Em [Duarte and Hruschka 2014] foi apresentada uma discussão sobre as características semânticas (baseadas em relações) utilizadas pelo ConceptResolver e como o resultado poderia ser melhorado com a adição de características morfológicas (baseadas em categorias). Ainda em [Duarte and Hruschka 2014] concluiu-se que quando ambas características, semânticas e morfológicas, são analisadas juntas melhores resultados são adquiridos, podendo assim impactar positivamente na precisão e cobertura da NELL. Além disso, apontou-se também que a independência entre ambas as abordagens deveria ser investigada como mais uma forma de melhoria na resolução de correferência. Para isso o acoplamento Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

57

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

F. Quécole, M.C. Duarte and E. R. Hruschka

entre ambas características foi apontado como uma possível solução, a qual é a abordagem investigada neste artigo, a partir uso de classificadores ensemble. A incompletude na base nas relações pode ocasionar problemas com a abordagem atual do sistema, que considera apenas as relações como atributos para o cálculo de correferências. Por isso, a utilização de atributos morfológicos contribuem na melhoria da resolução de correferências, minimizando esse problema. Em [Duarte and Hruschka 2014] foram propostos os seguintes atributos morfológicos: (1) Similaridade de strings: número real contido no intervalo fechado [0,1]. Quanto mais próximo de 1, maior a similaridade entre en1 e en2; (2) Palavras diferentes: atributo binário, tem valor = 1 caso en1 tenha uma quantidade de palavras diferentes comparado com en2, senão, assume 0; (3) Subconjunto: atributo binário, 1 caso en1 esteja contido em en2 ou vice-versa, e 0 caso contrário. Por exemplo Subconjunto(Shaquille O’Neal, Shaq) = 1, e Subconjunto(Shaq, Curry)=0; (4) Acrônimo: atributo binário. 1 caso um dos parâmetros for acrônimo do outro parâmetro, 0 caso contrário. Por exemplo: Acronimo(SP, Sao Paulo)=1, e Acronimo(RJ, Shaquille O’Neal)=0; (5) Similaridade de strings de Covington: valor real contido no intervalo de 0 a 1, calculado aplicando o algoritmo de Covington [Covington 1996]; (6) Proximidade: valor real no intervalo de 0 a 1, obtido aplicando o algoritmo baseado em JaroWinkler. [Carpenter 2007]. Além disso foram usados os seguintes atributos, os quais referem-se, de forma simulada, à forma de processamento utilizada pelo ConceptResolver: (7) Número de relações que compartilham a mesma entidade nomeada (RelacoesCompartilhadas): valor inteiro resultante da soma de relações contidas na base conhecimento da NELL que compartilham en1 e en2. Utilizando a Tabela 1 como exemplo, RelacoesCompartilhadas(Shaquille O’Neal, Shaq) = 5, e RelacoesCompartilhadas(Shaq, Stephen Curry) = 4; (8) Proporção de relações que compartilham a mesma instância (PropRelacoesCompartilhadas): valor inteiro, é a proporção entre relações que tenham o mesmo par para en1 e en2 e todas as relações que tenham en1 e en2. Utilizando como exemplo a Tabela 1, tem-se PropRelacoesCompartilhadas(Shaquille O’Neal, Shaq) = 5/5 = 1, e PropRelacoesCompartilhadas(Shaq, Stephen Curry) = 2/4 = 0.5. Para o presente artigo foram reproduzidos e recalculados todos os 8 atributos citados. As características morfológicas foram adaptadas do trabalho correlato apresentado em [Duarte and Hruschka 2014]. As características semânticas foram adaptadas a partir da simulação do ConceptResolver [Krishnamurthy and Mitchell 2011], o qual é o componente responsável pela resolução de correferência da NELL. Uma vez selecionados os atributos a serem utilizados, foi necessária a criação de uma base de dados para testá-los. Para isso, foi implementada um método para extração, que busca por candidatos a correferência em algumas relações específicas. A saída deste método passou por uma seleção manual, compondo o conjunto de dados para realização dos experimentos (base para treinamento e teste). Nesse artigo, foram utilizadas as relações personAlsoKnownAs (pessoa também conhecida como), athleteAlsoKnownAs(atleta também conhecido como) e cityAlsoKnownAs (cidade também conhecida como) e suas instâncias contidas na base de conhecimento da NELL para a mineração de candidatos. Com esse método, foram obtidos 200 pares (100 corretos e 100 incorretos) de entidades nomeadas, candidatas a correferência. Após a obtenção das características a serem investigadas e dos dados de teste, foram utilizados alguns classificadores, com o objetivo de avaliar o quão significativas as características selecionadas são. Foi utilizada o módulo em Python Scikit-learn [Pedregosa et al. 2011], e os seguintes classificadores: Gaussian Naive Bayes, K-Nearest Neighbors (KNN), Árvore de Decisão, Random Forest e Logistic Regression,Support Vector Classification. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

58

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Acoplamento para resolução de correferência em ambiente de aprendizado sem-fim

·

5

Para investigar a melhora de resultados com a abordagem que explora as duas visões, foram reproduzidos os experimentos contidos em [Duarte and Hruschka 2014], utilizando a base citada acima, obtida para este artigo, os quais são: — Experimento 1 (exp1): Considera características morfológicas e semânticas, isto é, utiliza todas as 8 características citadas acima. — Experimento 2 (exp2): Considera apenas as características morfológicas, ou seja, apenas as 6 primeiras citadas acima. — Experimento 3 (exp3): Considera apenas as características semânticas, ou seja, as 2 últimas citadas acima. Com objetivo de verificar a obtenção de melhorias através de uma análise da F-Measure , aplicandose um acoplamento de características morfológicas e semânticas, foi implementado um ensemble (em português, “Método de grupo”) [Dietterich 2002], que consiste em uma composição de classificadores independentes que são melhorados a partir de um processo de votação (Voting classifier). O ensemble apresentado nesse trabalho foi implementado utilizando o módulo em Python Sci-Kit [Pedregosa et al. 2011] e consiste no agrupamento de quatro classificadores, mas aplicados separadamente para dados morfológicos e semânticos, resultando em um total de oito classificadores. Os classificadores escolhidos foram: K-Nearest Neighbors (KNN), Random Forest, Árvore de Decisão e Gaussian Naive-Bayes (GNB). Referentes ao ensemble de classificadores, foram realizados mais dois experimentos: — Experimento 4 (exp4): ensemble composto pelos três classificadores de melhor desempenho nos experimentos anteriores (os 3 primeiros citados acima); — Experimento 5 (exp5): ensemble composto pelos classificadores do experimento 4 acrescidos do Gaussian Naive-Bayes. 4. DISCUSSÃO DE RESULTADOS Os resultados obtidos reforçam as conclusões discutidas em [Duarte and Hruschka 2014] e, além disso, apresenta o comportamento da abordagem de acoplamento utilizando-se ensemble de classificadores, a qual obteve melhores resultados que os apresentados em pelos autores citados. A Tabela III apresenta os resultados para cada classificador, levando em conta os experimentos 1, 2 e 3. O valor em negrito destaca qual dos experimentos obteve melhor desempenho em termos de F-Measure (também conhecida como F-Score). Na tabela III, referente a reprodução dos experimentos apresentados em [Duarte and Hruschka 2014], porém utilizando a nova base extraída neste artigo, em todos os classificadores foram obtidos melhores resultados no uso conjunto de características morfológicas e semânticas (exp1). Os classificadores que utilizaram apenas características semânticas (exp3) apresentaram os piores resultados, causando o aparecimento de muitos falso-positivos. Enquanto classificadores apenas-morfológicos (exp2) tendem a apontar menos falso-positivos e mais falso-negativos. Tal comportamento reforça a ideia de que as duas abordagens apresentam erros independentes. Diferentemente dos resultados apresentados em [Duarte and Hruschka 2014], em que classificadores apenas semânticos apresentaram melhores resultados em detrimento de classificadores apenas morfológicos, neste artigo os resultados foram opostos. Atribui-se tal diferença tanto ao fato de ambos os experimentos terem sido feitos com bases pequenas. Neste artigo, a extração foi feita automaticamente, através de uma implementação em Java de uma ferramenta para realização da tarefa, o que possibilita maior facilidade para experimentos e investigações futuras, diferentemente da abordagem de [Duarte and Hruschka 2014], em que os autores extraíram manualmente a base. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

59

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

F. Quécole, M.C. Duarte and E. R. Hruschka Tabela III.

Algoritmo Gaussian Naive Bayes KNN Árvore de Decisão Random Forest Logistic Regression Support Vector Classification

Resultados dos classificadores selecionados Matriz de Confusão Morfológico Semântico Sim Não Certo Errado Certo Errado exp1 59 41 97 3 exp2 55 45 99 1 exp3 91 9 19 81 exp1 85 15 81 19 exp2 76 24 91 9 exp3 78 22 48 52 exp1 79 21 82 18 exp2 74 26 87 13 exp3 67 33 73 27 exp1 84 16 83 17 exp2 72 28 83 17 exp3 67 33 71 29 exp1 75 25 85 15 exp2 68 32 90 10 exp3 58 42 66 34 exp1 71 29 93 7 exp2 65 35 96 4 exp3 81 19 50 50

F-Measure 0.707 0.680 0.670 0.835 0.817 0.665 0.821 0.807 0.709 0.831 0.789 0.695 0.775 0.753 0.580 0.783 0.755 0.496

Objetivando-se investigar o uso de acoplamento a partir de nova base extraída da NELL e após a reprodução dos experimentos apresentados em [Duarte and Hruschka 2014], foram realizados dois experimentos (Exp4 e Exp5) com o uso de ensemble de classificadores. Os resultados de tal abordagem são apresentados na Tabela IV, a qual apresenta o uso de três classificadores (exp4) e com o acréscimento do classificador Gaussian Naive-Bayes(exp5), com diferentes valores para K, em que K representa a quantidade de grupos formados na validação cruzada. Tabela IV. Sim

exp4

exp5

Resultado do Ensemble Não

K

Certo

Errado

Certo

Errado

5 10 20 5 10 20

93 96 98 98 97 98

7 4 2 2 3 2

75 72 74 75 76 80

25 28 26 25 24 20

F-Measure 0.853 0.857 0.875 0.879 0.883 0.889

Analisando os resultados obtidos na Tabela IV, pode-se concluir que a abordagem acoplada obteve resultados mais precisos que a abordagem semântica e a morfológica experimentadas somente como um único conjunto de atributos. Os resultados obtiveram melhores valores de F-Measure de acordo o valor de K, ou seja, quanto maior K melhor foi o resultado. Tal comportamento reforça que o conjunto dos atributos usados são informativos e que os modelos gerados obtiveram bom desempenho a partir da base selecionada evitando overfitting. A Tabela V apresenta algumas instâncias avaliadas pelo ensemble, na qual os números referem-se aos classificadores apresentados na seção 3, T e F representam o voto de cada classificador: correferência e não-correferência, respectivamente; Azul significa que a classificação está correta, e vermelho, incorreta. A tabela evidencia a colaboração de cada classificador morfológico e semântico para a resolução das correferências. No instância que relaciona “New York City” e “Big Apple” pode-se notar que apenas a morfologia não seria capaz de apontar a relação entre os termos, demonstrando que os atributos Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

60

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Acoplamento para resolução de correferência em ambiente de aprendizado sem-fim

7

Tabela V. EN1 Canadian Jewish Congress New York City Ho Chi Minh Madrid David Lee

Instâncias avaliadas pelo Ensemble Morfológico Semântico EN2 1 2 3 4 1 2 3 4 CJC T T T T F F F F Big Apple F F F F T T T T Saigon F F F F F T T T Rio F T T T F F F F David Robinson T T T T F F F F

·

Ensemble Class Coreference Coreference No-coreference No-coreference Coreference

semânticos ainda são os mais importantes nesse cálculo. Mas quando esses falham, como no caso de “Canadian Jewish Congress” e “CJC”, é interessante considerar os atributos morfológicos na resolução. Os erros presentes nessa abordagem se devem ao fato do ensemble ser implementado pelo processo de votação simples, isto é, os resultados de todos os classificadores tem o mesmo peso, o que possibilitou a propagação de alguns erros. Como no caso de “David Lee” e “David Robinson”, que são termos morfologicamente semelhantes, mas sem relação. Embora todos os classificadores semânticos apontaram que não existiria relação entre os termos, a instância foi erroneamente classificada como correferência, uma vez que todos os classificadores morfológicos apontavam tal fato. Outro caso derivado do problema acima envolve “Ho Chi Minh” e “Saigon”, os termos não são morfologicamente próximos, mas os classificadores semânticos apontavam uma possível relação, exceto por um deles. Com isso, a instância também foi incorretamente classificada, dessa vez como uma nãocorreferência. A tabela VI apresenta uma pequena amostra de instâncias avaliadas e suas classificações de acordo com cada método, sendo: — — — —

M: Método apenas-morfológico (exp2); S: Método apenas-semântico (exp3); MS: Método morfológico e semântico (exp1); MSA: Método morfológico e semântico com acoplamento(exp5). Tabela VI. Exemplos de instâncias para cada abordagem EN1 EN2 Class M S MS Sin City Vegas YES NO YES NO Los Angeles LA YES YES NO NO Calcutta Kolkata YES YES NO YES Santiago Campo Grande NO NO YES NO Los Banos LA NO YES NO YES

MSA YES YES YES NO NO

Os resultados apresentados na Tabela VI comprovam que o método acoplado utilizando ensemble de classificadores foi capaz de apresentar melhores resultados. Demonstrou-se ainda que a importância de se considerar atributos morfológicos e semânticos de forma a melhorar os resultados com base nos erros independentes. Portanto, a abordagem proposta neste artigo atingiu bons resultados e poderá ser utilizada como base para estudos futuros quanto à análise de correferência da NELL a partir do acoplamento. 5. CONCLUSÃO E TRABALHOS FUTUROS A partir dos experimentos executados neste artigo, fica evidente que o uso acoplado de características morfológicas e semânticas apresenta melhores resultados na análise de correferências, levando-se em consideração que as duas abordagens apresentam erros independentes. Os métodos consideram aspectos diferentes de cada entidade nomeada e, por isso, apresentam erros independentes. Por exemplo, se apenas aspectos morfológicos forem considerados, pode-se avaliar “conserto” e “concerto” como Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

61

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

F. Quécole, M.C. Duarte and E. R. Hruschka

correferências, sendo que elas não denotam o mesmo sentido. Ou então, considerar “Rei do Pop” e “Michael Jackson” como não sendo correferências. Mas, ao considerar também aspectos semânticos, ficariam claras as relações entre os termos. Isso demonstra a importância dos erros independentes, por proporcionar que uma abordagem “corrija” os pontos fracos da outra, causando uma melhora na precisão. O uso de acoplamento para soluções de outros problemas da NELL já foram apresentados em [Carlson et al. 2009], [Hruschka Jr et al. 2013] e [Krishnamurthy and Mitchell 2011]. Da mesma forma neste corrente artigo, também atingiu-se resultados melhores, assim comprovando que o acoplamento é um bom aliado, também, para a resolução de correferências. Neste artigo, o uso acoplado de características morfológicas e semânticas apresentou melhores resultados dos que os obtidos pelo ConceptResolver [Krishnamurthy and Mitchell 2011], atual componente da NELL. Com isso, este artigo apresenta e propõe um novo método para resolução de correferências da NELL. O método proposto ainda não foi implementado no sistema, mas faz parte dos trabalhos futuros. E por fim, como continuidade deste trabalho, propõe-se investigar e implementar um novo método para a classificação a partir dos resultados e erros independentes dos classificadores base. REFERÊNCIAS Carlson, A., Betteridge, J., Hruschka, Jr., E. R., and Mitchell, T. M. Coupling semi-supervised learning of categories and relations. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. SemiSupLearn ’09. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1–9, 2009. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, Jr., E. R., and Mitchell, T. M. Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence. AAAI’10. AAAI Press, Atlanta, Georgia, pp. 1306–1313, 2010. Carpenter, B. Lingpipe for 99.99% recall of gene mentions. In Proceedings of the Second BioCreative Challenge Evaluation Workshop. Vol. 23. -, -, pp. 307–309, 2007. Covington, M. A. An algorithm to align words for historical comparison. Computational linguistics 22 (4): 481–496, 1996. Curran, J. R., Murphy, T., and Scholz, B. Minimising semantic drift with mutual exclusion bootstrapping. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics. Vol. 6. pp. 172–180, 2007. Dietterich, T. G. Ensemble learning. The handbook of brain theory and neural networks vol. 2, pp. 110–125, 2002. Duarte, M. C. and Hruschka, E. R. Exploring two views of coreference resolution in a never-ending learning system. In Hybrid Intelligent Systems (HIS), 2014 14th International Conference on. IEEE, pp. 273–278, 2014. Hruschka Jr, E. R., Duarte, M. C., and Nicoletti, M. C. Coupling as strategy for reducing concept-drift in never-ending learning environments. Fundamenta Informaticae 124 (1-2): 47–61, 2013. Krishnamurthy, J. and Mitchell, T. M. Which noun phrases denote which concepts? In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. HLT ’11. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 570–580, 2011. Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Dalvi, B., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X., Saparov, A., Greaves, M., and Welling, J. Never-ending learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. AAAI’15. AAAI Press, Austin, Texas, pp. 2302–2310, 2015. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (Oct): 2825–2830, 2011. Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., and Soderland, S. Textrunner: Open information extraction on the web. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. NAACL-Demonstrations ’07. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 25–26, 2007. Zhu, X. pp. 892–897. In C. Sammut and G. I. Webb (Eds.), Semi-Supervised Learning. Springer US, Boston, MA, pp. 892–897, 2010.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

62

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A Machine Learning Predictive System to Identify Students in Risk of Dropping Out of College Gabriel Silva1 , Marcelo Ladeira1 University of Brasília, Brazil [email protected] Abstract. The University of Brasília (UnB) suffers from undergraduate student drop out, which implies negative academic, economic and social consequences. UnB’s approach to the problem consist of separating it’s students in two groups: those that are in risk of dropping out and those that are not, and counsel the students in the former group. This paper describes the development of a predictive system capable of indicating the risk of a student dropping out. This way, UnB could act before it became too late and also act according to the risk presented by the student. For the development of the predictive system, data of students from computer science related courses that entered and left UnB from 2000 to 2016 were used. The data do not contain student identification. Machine learning (ML) algorithms were used to induce models that had their performance analyzed. The ML algorithms applied were ANN, linear regressor, Naive Bayes, random forests and SVR. Machine Learning models got, in general, good performance. The best performance came from the models induced using linear regression. Results obtained indicate potential in using machine learning to predict the risk of students from computer science related courses dropping out of university. The methodology used can be applied for other courses from UnB or other universities. Categories and Subject Descriptors: I.2.6 [Artificial Intelligence]: Learning Keywords: machine learning, data mining, student evasion, UnB

1.

INTRODUCTION

Student dropping out of Brazilians universities are a significant problem, with academic, social and economic consequences. University of Brasilia (UnB) is not an exception, being significantly affected by the problem 1 . This paper describes a possible approach for the problem: the development of a predictive analysis system that estimates the risk of a given student dropping out of college. In case of success, the system would allow the university to take measures with flexibility (according to the risk presented by a student) and in advance (before it became too late). The system outputs, for every student, a triple (v1 , v2 , v3 ), with positive values that sum 1. These values indicate, respectively, the chance of the given student graduate, evade or migrate from the course he or she is in. The rest of this article is organized in the following structure: in the next section, the methods used are thoroughly explained; after that, the good results obtained are presented and discussed; finally, conclusion is made and ideas for future work are exposed. 1 UnB

had a money lost estimated in R$ 95.6 millions, according to the Brazilian press Correio Brasiliense [cor ]

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

63

5th KDMiLe – Proceedings

2

2.

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Gabriel Silva and Marcelo Ladeira

METHODS

In this section, the methodology used is discussed. First, the data obtained is described, along with the train and test division. After that, feature selection is explained. Next, the division by semester is motivated. Afterwards, the machine learning algorithms used are listed, with their parameters configuration. Finally, the performance metric used is explained. 2.1 Data, Data Division and Train and Test Division The data used are from undergraduate students that entered and left UnB courses from 2000 to 2016. To simplify the analysis, only courses in areas related to computer science were considered. Therefore, the courses considered were: Computer Science (bachelor degree), Computing (licentiate degree), Computer Engineering, Mechatronics Engineering, Network Engineering and Software Engineering. In exploratory data analysis, it was possible to observe that the features varied significantly with the course of the student being considered. Another useful information obtained was that the proportion of students capable of graduating depends on the age the students enter university: students that enter older in university are less likely to graduate. These observations led to the decision of partitioning the data in four distinct databases: —Senior Students: all students with more than 30 years. —Young Students from FT: contain all students that entered in UnB with 30 years or less and course Mechatronics Engineering or Network Engineering. This courses have the distinction of being associated with FT (Faculty of Technology). —Young Students from Computing: contain all students that entered in UnB with 30 years or less and course Computing. Computing is a licentiate degree in UnB, meaning that the students from the course are prepared to be teachers of Computer Science. Another peculiarity is the fact that it is the only night course from the ones we are considering. —Young Students from Computer Science: contain all students that entered in UnB with 30 years or less and course Computer Science (bachelor degree), Computer Engineering or Software Engineering. In order to induce models with machine learning algorithms, data were partitioned in a training set and a test set. As indicated in [da Silva and Adeodato 2012], a realistic division of the data for the problem we are dealing with should be an “Out of Sample” division, in which the training set is composed by the firsts instances and the test set from the last ones. This way, we get a realistic scenario in which the models are induced to be applied in a future time. With this in mind, the training set was composed of students that entered in university from 2000 to 2009 while the test set was composed of students that entered in university from 2010 to 2016. 2.2 Features and Feature Selection After initial research (see [da Silva and Adeodato 2012] or [Kinnunen and Malmi 2006]) on which features should be included for the models to train, the following personal features were selected initially: —Age when student started the course —Course —Entered via quota or not —Race —Sex —Type of secondary school (Private or Public) Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

64

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

·

3

—Way In In addition to personal features, academic features were, obviously, also considered: —Amount of Credits Done —Average Grades in a Semester —Pass rate on the hardest subject of a semester —Indicator of Condition: UnB classifies students as “in condition” or not “in condition” and this classification (see [man ]) is related to the risk of students dropping out. This boolean variable indicates if the student is “in condition” or not. —Pass Rate, Fail Rate and Drop Rate: indicate, respectively, the percentage of the subjects a given student passes, fails and drops. —Position in relation to fellow students: for a given student S, this feature indicates, from all students of the same year, semester and course of S, how many have higher grades than S. —Rate of Academic Improvement: reason between grades of a student in the current semester by the grades of the same student in the previous one. Feature selection was done afterwards. The features race and type of school were eliminated from further analysis, because of having a significant amount of missing values (more than 40%). To avoid redundancy, a Kendall test (see [Noether 1981]) was applied to check if any two of the attributes had very strong correlation. Because the test indicated that the fail rate and the pass rate were significantly related, a decision was made to consider only the pass rate (this strong correlation was verified for all 4 databases). Moreover, to check if all the features were really necessary, decision trees were employed: features that did not appear as part of a decision tree were not considered in further analysis. Results indicated that, for young students of computing the feature course was irrelevant (which makes sense, since all students from this group have the same courses). For senior students, the features course, quota and drop rate were considered irrelevant. For the other 2 databases, no feature was considered irrelevant. 2.3 The Necessary Semester Division The predictive system, for the business problem here considered, should be capable of calculating the drop out risk for students in the beginning of the course and for students in the end of the course. In relation to that, students academic features change from semester to semester. This indicates that a necessary semester division must be carried out. Thus, the models are induced and evaluated separately semester by semester. Therefore, initially, the models are induced on the train dataset containing features from the 1o semester of the students and are evaluated on the test dataset containing features from the 1o semester of the students. Next, this procedure is repeated for the 2o semester and so on. 2.4 Machine Learning Algorithms The machine learning algorithms used in this research were: ANN, linear regressor, Naive Bayes, random forest, SVR. For more information on this algorithms, we recommend the excellent books [Kelleher et al. 2015] and [Abu-Mostafa et al. 2012]. To establish a baseline, the ZeroR model was considered (this model simply always picks the most common class as the predicted one). The Python programming language was used, combined with the scikit-learn library [Pedregosa et al. 2011] (v. 0.18.1). Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

65

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Gabriel Silva and Marcelo Ladeira Table I. ANN’s configuration, according to the database Database Hidden Layer Size Learning Rate Senior Students 24 0.7 Young Students from FT 100 0.001 Young Students from Computing 36 1.0 Young Students from Computer Science 36 0.001

Table II. SVR’s configuration, according to the database Database Kernel Type Penalty Parameter Senior Students linear 1.0 Young Students from FT linear 1.0 Young Students from Computing linear 1.0 Young Students from Computer Science RBF 1.0

Table III.

Naive Bayes configuration, according to the database Database Feature Distribution Senior Students Gaussian Young Students from FT Gaussian Young Students from Computing Bernoulli Young Students from Computer Science Multinomial

Preliminary studies were made to determine the best configurations for the ANN, the Naive Bayes and the SVR parameters. For the other machine learning algorithms, their default configurations were used. The results varied according to the databases, and are shown on Tables I, II, III. 2.5

Performance Measures

As is standard in machine learning, the models were induced on the train dataset and had their performance measured in a test dataset, with unseen data. This process was done for each one of the semester studied and works as explained next. Each machine learning model generates, for each student in each semester, a triple that indicates the assessed possibility of a student graduating, evading or migrating. The highest value of the triple is used as the model prediction. This prediction is then compared to what really happened to the student. The metric used to evaluate the performance of the models was the F-measure. A given model has, for a given database, values of F-measure calculated, one for each semester. To summarize the performance of the model for a database, the mean of the F-measures was calculated. The models were compared to the ZeroR classifier. 3. RESULTS AND DISCUSSION

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

66

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

Table IV.

Table V.

Table VI.

·

5

F-measure Mean per Model, for Young Students from FT ML Model F-measure ANN 0.77 Linear Regressor 0.80 Naive Bayes 0.56 Random Forest 0.74 SVR 0.76 ZeroR 0.64

F-measure Mean per Model, for Young Students from Computing ML Model F-measure ANN 0.85 Linear Regressor 0.87 Naive Bayes 0.76 Random Forest 0.86 SVR 0.82 ZeroR 0.70

F-measure Mean per Model, for Young Students from Computer Science ML Model F-measure ANN 0.68 Regressor Linear 0.77 Naive Bayes 0.65 Random Forest 0.76 SVR 0.70 ZeroR 0.60

Table VII.

F-measure Mean per Model, for Senior Students ML Model F-measure ANN 0.62 Linear Regressor 0.75 Naive Bayes 0.28 Random Forest 0.71 SVR 0.80 ZeroR 0.61

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

67

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Gabriel Silva and Marcelo Ladeira

The mean of the F-measures was calculated for each model and database. The results are shown on Tables IV, V, VI and VII. This results show that, in general, the ML models have better performance than the ZeroR (Naive Bayes being the exception). The bad performance of the Naive Bayes algorithm may be due to the fact that the features are not conditionally independent, a hypothesis necessary for the use of the algorithm. Lastly, the good result obtained by the linear regressor should be detached. That learning model obtained the best results in three of the four databases, with a mean value of F-measure close to 0.8. This goes in accordance with the theory of machine learning that assures that, generally, linear models are not likely to overfit and are good initial alternatives to the problem [Abu-Mostafa et al. 2012]. 4. CONCLUSION The linear regressor algorithm induced models with good performance in assessing the risk of a student graduating, dropping out or migrating. This result show the viability of using machine learning for predicting the risk of students dropping out of college. The methodology used can be applied for other undergraduate courses from UnB or other universities. A natural sequence for this research would be the implementation of dropping out related actions in UnB based on the risk predicted by the system here described. This actions should be evaluated based on criteria such as drop out reduction obtained and acceptance by university members. Another possible sequence would be testing the system for other courses of UnB or another university. REFERENCES http://www.correiobraziliense.com.br/app/noticia/cidades/2015/10/10/interna_cidadesdf,501999/evasoesna-universidade-de-brasilia-causam-prejuizo-de-r-95-mi.shtml. Acessed in July 7, 2015. http://www.unb2.unb.br/administracao/decanatos/deg/downloads/index/guiacalouro.pdf. Acessed in July 7, 2017. Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H.-T. Learning from data. Vol. 4. AMLBook New York, NY, USA:, 2012. da Silva, H. R. B. and Adeodato, P. J. L. A data mining approach for preventing undergraduate students retention. In Neural Networks (IJCNN), The 2012 International Joint Conference on. IEEE, pp. 1–8, 2012. Kelleher, J. D., Mac Namee, B., and D’Arcy, A. Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. MIT Press, 2015. Kinnunen, P. and Malmi, L. Why students drop out cs1 course? In Proceedings of the second international workshop on Computing education research. ACM, pp. 97–108, 2006. Noether, G. E. Why kendall tau. Teaching Statistics 3 (2): 41–43, 1981. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research vol. 12, pp. 2825–2830, 2011.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

68

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Classificação Hierárquica e Não Hierárquica de Elementos Transponíveis G. T. Pereira, R. Cerri Universidade Federal de São Carlos, Brazil {gean.pereira,cerri}@ufscar.br Resumo. Em problemas tradicionais de classificação, uma instância em um conjunto de dados é atribuída a apenas uma classe dentre um conjunto normalmente pequeno de classes. Entretanto, existem problemas mais complexos com dezenas e até centenas de classes, conhecidos na literatura como problemas de Classificação Hierárquica, em que uma instância é assinalada não só a uma classe mas também as suas superclasses, arranjadas em uma estrutura hierárquica. Problemas hierárquicos possuem aplicação em uma gama de cenários, como na Bioinformática. Nesse contexto, um tema que vem ganhando atenção é a classificação de Elementos Transponíveis (TEs), que tratam-se de sequências de DNA capazes de se mover dentro do genoma de uma célula, modificando a funcionalidade de seus genes, e que por esse motivo, são de grande importância para a variabilidade genética das espécies. Nesse trabalho a classificação dos TEs é tratada como um problema de classificação hierárquica, onde é proposto um novo método utilizando Algoritmos Genéticos que gera regras de classificação. Trata-se da primeira tentativa na literatura de aplicar Algoritmos Genéticos para induzir regras que classificam TEs. O método proposto foi comparado com outros classificadores, considerando tanto métricas tradicionais quanto hierárquicas. Assim, objetivou-se verificar a capacidade dos classificadores não hierárquicos no contexto hierárquico, além de analisar o desempenho do classificador hierárquico em uma classificação tradicional. Como resultados, o método proposto superou todos os demais nas avaliações hierárquicas, e mesmo não obtendo os melhores resultados nas avaliações convencionais, mostrou resultados promissores nos cenários para os quais não foi originalmente projetado. Ademais, alguns classificadores tradicionais mostraram-se promissores no contexto hierárquico, deixando em aberto as possibilidades de adaptações futuras. Categories and Subject Descriptors: H.2.8 [Database Applications]: Data Mining; I.2.6 [Artificial Intelligence]: Learning Keywords: Hierarchical Classification, Transposable Elements, Rule Induction, Genetic Algorithms

1.

INTRODUÇÃO

Em Aprendizado de Máquina (AM), a maioria dos trabalhos acerca de classificação aborda a Classificação Plana (flat), onde uma instância de um conjunto de dados é associada a uma classe dentre um conjunto normalmente pequeno de classes [Mitchell et al. 1997]. Entretanto, existem problemas mais complexos que envolvem dezenas e até centenas de classes arranjadas de acordo com uma estrutura hierárquica, a qual possui certas restrições. Tal tipo de classificação é conhecida na literatura de AM como Classificação Hierárquica (CH) (no inglês, Hierarchical Classification) [Silla and Freitas 2011]. Uma das áreas de aplicação da CH é a Bioinformática, que como já apontado na literatura, quando usada em conjunto com AM, frequentemente resulta em melhores predições em comparação com outras abordagens [Loureiro et al. 2013]. Nesse contexto, um tema que vem ganhando atenção é o estudo dos chamados Elementos Transponíveis (TEs) (no inglês, Transposable Elements), fragmentos de DNA que se movem dentro do genoma de seus hospedeiros [Biémont 2010]. Conforme pesquisas recentes, os TEs são responsáveis por mutações em diversos organismos, inclusive no genoma humano, o que lhes garantiu a alcunha de grandes responsáveis pela variabilidade genética das espécies [Jurka et al.

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

69

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

G. T. Pereira e R. Cerri

2005]. A correta classificação desses elementos traz benefícios para o entendimento de várias espécies, e como se deu a evolução das mesmas, aspectos importantes para o campo da Biologia/Bioinformática. Neste trabalho é proposto um classificador global baseado em Algoritmos Genéticos (AGs) [Holland 1975] que gera regras de CH. Além disso, são realizadas comparações desse método com classificadores bem estabelecidos na literatura, no qual visa-se analisar a capacidade do método hierárquico em predizer classes folhas, além de se verificar o potencial de técnicas flat em CH de TEs. As principais contribuições deste trabalho são: —Apresentação de um novo método de CH baseado na abordagem global que gera regras; —Aplicação de classificadores flat em dados inéditos de TEs estruturados hierarquicamente; —Validação de classificadores flat em CH e comparação de resultados com um método hierárquico. 2.

CLASSIFICAÇÃO HIERÁRQUICA

Em CH uma instância é assinalada não só a uma classe mas também as suas superclasses. Várias outras peculiaridades fazem parte desse tipo de classificação, como as estruturas hierárquicas utilizadas, sendo comumente Árvores ou Grafos Acíclicos Direcionados. O que as difere é o número de caminhos que um classificador pode explorar na estrutura. Outro ponto diz respeito as diferentes abordagens existentes na CH, chamadas de global e local [Silla and Freitas 2010]. Na abordagem global um único classificador é induzido para lidar com todas as classes da hierarquia e a classificação de uma nova instância acontece em um único [Vens et al. 2008]. Já a abordagem local possui uma modularidade de treinamento em que usa de informações locais das classes para comumente, classificar uma nova instância explorando a hierarquia nível a nível [Silla and Freitas 2010]. Assim como adota estruturas de dados e abordagens peculiares, problemas de CH requerem métricas de avaliação específicas para que uma análise adequada dos classificadores seja feita. Isso se dá pelo fato de que as medidas tradicionais usadas na classificação flat não consideram as predições feitas em vários níveis de uma hierarquia. Duas medidas populares em trabalhos de CH, oriundas de adaptações em métricas tradicionais, são a Precisão Hierárquica (hP) e a Revocação Hierárquica (hR) propostas por [Kiritchenko et al. 2006]. Ainda, existe uma métrica que combina as citadas anteriormente, chamada Medida-F Hierárquica (hF). Essas medidas são apresentadas abaixo, onde Ci e Zi são, respectivamente, o conjunto de classes verdadeiras e preditas para uma instância i.

hP =

3.

P

|Zi iP

∩ Ci | i |Zi |

hR =

P

|Zi iP i

∩ Ci | |Ci |

hF =

2 ∗ hP ∗ hR hP + hR

(1)

TRABALHOS RELACIONADOS

Em [Ghazi et al. 2010], a tarefa de CH é explorada no contexto de classificação automática de textos através de emoções expressadas por escrito. Os autores propõem um método que organiza hierarquicamente a neutralidade, a polaridade e as emoções presentes nos textos. Tal método é testado juntamente com um método flat em dois conjuntos de dados onde é verificado que o método hierárquico supera a abordagem flat comparada. Outra análise interessante feita pelos autores é que o alto desbalanceamento presente nos conjuntos de dados usados e que tem grande impacto no desempenho dos classificadores, é suavizado em abordagens hierárquicas. Em [Zimek et al. 2010], são realizadas comparações com classificadores multi-classe que exploram informações hierárquicas e com classificadores flat, entre esses SVMs, Árvores de Decisão e ensembles. Os dados utilizados tem relação com classificação de enzimas de proteínas e dados sintéticos. Apesar da natureza hierárquica dos problemas, os autores concluem que nem sempre a exploração de informações Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

70

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

·

3

hierárquicas resulta em melhores resultados. Segundo eles, a exploração das hierarquias de classes melhora as predições no caso de dados sintéticos, diferentemente da classificação de proteínas. 4.

CLASSIFICAÇÃO HIERÁRQUICA COM UM ALGORITMO GENÉTICO

Nessa seção é apresentado o HC-GA (do inglês, Hierarchical Classification with Genetic Algorithm), um método global de CH que gera um conjunto de regras de classificação hierárquica com bom desempenho preditivo e interpretabilidade. Seu processo de evolução de regras é detalhado no Algoritmo 1. O HC-GA implementa uma cobertura sequencial de instâncias para gerar e evoluir regras de classificação. Durante o processo evolutivo das regras, as instâncias cobertas são removidas do conjunto de treinamento, permitindo a geração de novas regras que irão cobrir as instâncias remanescentes. Algoritmo 1: Procedimento do HC-GA. 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

procedimento HC-GA(pc , pm , P) Entrada: Dados de treinamento D, número de gerações G, tamanho da população p, número máximo de instâncias cobertas por regra maxCov, número mínimo de instâncias cobertas por regra minCov, número máximo de instâncias não cobertas maxUncov, taxa de cruzamento cr, taxa de mutação mr, tamanho do torneiro t, número de indivíduos selecionados por elitismo e, probabilidade de usar um termo numa regra pt. Saída: Conjunto de Regras evoluídas início Regras ← ∅ enquanto (|D| > maxU ncov) faça populacaoInicial ← geraP opulacao(D, p, pt) calculaF itness(populacaoInicial, D) populacaoAtual ← populacaoInicial melhorRegra ← melhor regra da populacaoAtual segundo o f itness j←G enquanto j > 0 OU regraConverge() faça novaP opulacao ← ∅ novaP opulacao ← novaP opulacao ∪ elitismo(populacaoAtual, e) pais ← selecaoP orT orneio(populacaoAtual, t, e, p) f ilhos ← cruzamentoU nif orme(pais, cr) f ilhos ← mutacao(f ilhos, mr, pt) novaP opulacao ← novaP opulacao ∪ f ilhos novaP opulacao ← buscaLocal(novaP opulacao, minCov, maxCov) populacaoAtual ← novaP opulacao calculaF itness(populacaoAtual, D) melhorRegra ← obtemM elhorRegra(populacaoAtual, melhorRegra) j ←j−1 Regras ← Regras ∪ melhorRegra Remove instâncias de D cobertas pela melhorRegra

4.1 Representação de indivíduos Um indivíduo no HC-GA é um vetor codificado de tamanho fixo (definido dinamicamente) contendo valores reais que representam os antecedentes de uma regra de classificação. Cada antecedente é um teste sobre um atributo do conjunto de dados, sendo os seus possíveis componentes uma FLAG, um operador (OP) e um índice/valor de atributo (valores ∆). Logo, cada conjunto de quatro posições do indivíduo representa um antecedente, sendo este uma 4-tupla{FLAG, OP, ∆1 , ∆2 }. O gene FLAG do vetor codificado pode receber o valor 0 ou 1, indicando respectivamente a ausência ou presença do teste correspondente em um antecedente de regra (ativado ou não). Isso permite que as regras tenham diferentes números de testes. Já o gene OP define, de forma aleatória, um dos operadores possíveis para um determinado atributo testado (numérico/ordinal ou categórico). Por sua vez, os genes ∆1 e ∆2 recebem os valores de números reais usados como condições de teste. O HC-GA consegue trabalhar com atributos categóricos, numéricos e ordinais. Como os dados adotados contêm apenas atributos numéricos, o método utilizou apenas os operadores ≤ e ≥. Caso o operador ≤ for usado, ∆1 é definido como 0 e ∆2 recebe o valor do atributo a ser testado, e no caso Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

71

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

G. T. Pereira e R. Cerri

do operador ≥ ser escolhido, o contrário ocorre. Também é possível verificar se um valor de atributo numérico pertence a um determinado intervalo, como por exemplo, ∆1 ≤ Ak ≤ ∆2 , onde os valores ∆ são os limites inferior e superior para atributos numéricos em um teste. Nesse caso, os valores de ∆ são escolhidos aleatoriamente para que o valor do atributo na instância satisfaça a condição de teste. É importante enfatizar que todos os operadores possíveis são previamente indexados com índices fixos, possibilitando que um teste seja facilmente montado através de índices e operadores correspondentes. 4.2 Geração de uma população No HC-GA a população é criada por um procedimento de semeadura no qual uma instância de treinamento é selecionada aleatoriamente e seus atributos são usados para criar um indivíduo. Cada atributo da instância tem uma probabilidade pt de ser usado no antecedente do indivíduo a ser gerado. Cabe ressaltar que a mesma probabilidade pt é usada para definir o valor do gene FLAG da 4-tupla correspondente ao atributo Ai . Geralmente um valor baixo de pt é usado tanto para a semeadura quanto para a mutação dos genes FLAG, pois um alto valor resultaria em regras com muitos testes ativos e que cobririam apenas algumas instâncias ou somente a instância usada como semente. Esse procedimento de semeadura garante que cada indivíduo cubra ao menos uma instância de treinamento sendo repetido até que uma população de tamanho desejado seja gerada. Uma instância é dita coberta por uma regra se todos os seus testes ativos forem satisfeitos pelos atributos na instância. 4.3 Cálculo da aptidão Uma variedade de funções de aptidão (fitness) foram implementadas e testadas no HC-GA, porém a que obteve melhores resultados fora uma combinação ponderada de duas medidas: o Percentual de Cobertura (PC) de uma regra junto ao seu Ganho de Variância (VG) [Vens et al. 2008]. O PC é r| obtido por P C(r) = |S |S| , em que Sr é o conjunto de instâncias cobertas pela regra r e S é o conjunto total de instâncias. Já o Ganho de Variância é apresentado na Equação 2.

V G(r, S) = var(S) −

|Sr | |S¬r| × var(Sr) − × var(S¬r) |S| |S|

(2)

Segundo essa equação, o conjunto de treinamento S é dividido em dois subconjuntos: o conjunto Sr de instâncias cobertas pela regra r e o conjunto S¬r de instâncias não cobertas. O Ganho de Variância da regra r é calculado em relação ao conjunto S. Esse cálculo também envolve a Variância (var) do conjunto de instâncias cobertas e não cobertas pela regra r, apresentado na Equação 3.

V ariancia(S) =

P|S|

k=1

distanciaEuclidiana(vk , v)2 |S|

(3)

Esta Variância é definida como a distância quadrática média entre o vetor de classes de cada instância e o vetor médio de classes do conjunto de instâncias considerado (S, Sr ou S¬r ). Para tal, a distância euclidiana ponderada apresentada na Equação 4 é utilizada. v u |C| uX distanciaEuclidiana(v1 , v2 ) = t wi × (v 1,i − v 2,i )2

(4)

i=1

4.4 Processo evolutivo Após uma população de regras ser gerada (Algoritmo 1, linha 5) e sua aptidão calculada (Algoritmo 1, linha 6), o processo evolutivo é iniciado. Primeiramente, as melhores e regras da população atual são Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

72

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

·

5

salvas (Algoritmo 1, linha 12). Logo, um conjunto de p − e regras pai (onde p é o tamanho da população e e o número de indivíduos selecionados por elitismo) é selecionado usando Seleção por Torneio [Wetzel 1983] (Algoritmo 1, linha 13), o que viabiliza o processo de cruzamento ser aplicado às regras pai que são submetidas a um cruzamento uniforme (Algoritmo 1, linha 14) para gerar as regras filha. Além disso, também é aplicado um cruzamento especializado que visa combinar regras que classificam instâncias similares no espaço euclidiano. Esse operador de cruzamento especializado considera a Distância Euclidiana Ponderada entre os vetores médios de classe das regras representados por vr , em que r é uma regra de tamanho |C|, |S | sendo C o conjunto de classes da hierarquia. O i-ésimo componente de vr é obtido por vr,i = |Sr,i , r| onde Sr,i é o conjunto de todas as instâncias de treinamento cobertas por r e que são classificadas na classe i, enquanto que Sr é o conjunto de todas as instâncias de treinamento cobertas por r. Portanto, cada posição vr,i possui a proporção de instâncias cobertas por r que são classificadas em i. Quando uma nova geração de regras filha é obtida, o operador de mutação é aplicado a uma porcentagem mr dos indivíduos (Algoritmo 1, linha 15). Cada indivíduo pode sofrer mutação em seu gene FLAG (ter um teste incluído ou removido na regra), ou sofrer uma operação de generalização/restrição. As operações de generalização/restrição modificam os valores ∆ nos testes de uma regra a fim de torná-las mais gerais (incrementando-os) ou específicas (decrementando-os). Cada um dos indivíduos selecionados tem 50% de chance de sofrer uma mutação no gene FLAG e 50% de chance de sofrer uma generalização ou restrição. Se a mutação no gene FLAG ocorrer, todos os FLAGs das 4-tuplas de um indivíduo são invertidos (0 para 1, ou o inverso), de acordo com uma probabilidade pt. Após todos esses procedimentos, um operador de busca local é aplicado (Algorithm 1, linha 17). Este visa assegurar que as regras geradas cubram entre um número mínimo e máximo de instâncias, tornando-as nem muito específicas nem gerais demais. O número mínimo (minCov ) e o número máximo (maxCov ) cobertos por uma regra são parâmetros especificados pelo usuário, e caso uma regra não os cumpram, esta é descartada. No final da geração de uma nova população a melhor regra obtida é salva (Algorithm 1, linha 20). Para a seleção da melhor regra dois fatores são considerados: o fitness da regra e se ela cobre um número mínimo de instâncias especificadas pelo parâmetro minCov. Inicialmente, quando a primeira geração de indivíduos é criada, a melhor regra é salva apenas levando em consideração o seu fitness. A partir da segunda geração e adiante, todas as regras são ranqueadas de acordo com seus valores de fitness e comparadas com a melhor regra da geração anterior. Se uma nova regra apresentar um valor de fitness superior ao obtido pela melhor regra da geração anterior, e também cobrir um número mínimo de instâncias minCov, ela se torna automaticamente a melhor regra atual. Esse processo evolutivo (Algoritmo 1, das linhas 10 a 21) é executado até se atingir um número máximo de gerações ou até a convergência da população. A população converge quando uma mesma regra continua sendo a melhor regra depois de um número específico de gerações. Por fim, no final do processo evolutivo a melhor regra é salva em um conjunto final de regras (Algoritmo 1, linha 22) e as instâncias cobertas por esta regra são removidas do conjunto de treinamento (Algoritmo 1, linha 23). O processo evolutivo é então reiniciado com a criação de um novo conjunto de regras iniciais. Este procedimento de cobertura sequencial de instâncias é repetido até que todas, ou quase todas (parâmetro maxNotCov do Algoritmo 1) as instâncias de treinamento sejam cobertas. Se ainda existirem instâncias não cobertas no conjunto de treinamento ao final do processo, elas são classificadas usando uma regra padrão. Essa regra simplesmente classifica as instâncias remanescentes no vetor médio de todas as classes do conjunto de treinamento, representado na Equação: v =

P|S| i

|S|

vi

.

O vetor médio v de todas as classes de treinamento é obtido utilizando todos os vetores de classes de todos as instâncias de treinamento. Cada instância i do conjunto de treinamento S, possui suas classes representadas em um vetor vi de |C| posições, em que C é o número de classes do problema. Se uma instância i pertence a uma classe j, a posição vij de seu vetor recebe o valor 1, caso contrário, Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

73

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

G. T. Pereira e R. Cerri

recebe 0. Portanto, cada posição do vetor corresponde a uma classe e cujo valor real contido pode ser interpretado como a probabilidade, no intervalo [0,1], da instância pertencer a dada classe. 4.5 Geração do consequente Assim que os antecedentes de uma regra r são criados, é possível calcular seu vetor médio de classes. |S | O vetor médio de classes vr de uma regra r é obtido através da Equação vr,i = |Sr,i , que pode ser r| interpretado como o consequente de r. Esse vetor possui tamanho |C|, sendo C o número de classes da hierarquia. Segundo a equação, Sr corresponde ao conjunto de todos os exemplos de treinamento cobertos pela regra r, enquanto que Sr,i representa o conjunto de todos os exemplos de treinamento cobertos pela regra r que são assinalados a classe i. Portanto, cada posição vr,i possui a proporção de exemplos cobertos por r que são classificados na classe i. Ao final, as regras do HC-GA são formadas pela união via cláusulas E dos antecedentes gerados juntamente com seus respectivos consequentes. 5. EXPERIMENTOS E DISCUSSÃO DOS RESULTADOS Nessa seção são apresentados os resultados obtidos por classificadores tradicionais de AM além do método hierárquico HC-GA. Foram utilizados dois conjuntos de dados de TEs extraídos de duas bases de dados públicas: o (1) PGSB1 , que contêm sequências de repetições de plantas e o (2) REPBASE2 , que armazena sequências de repetições de DNA oriundos de diferentes organismos eucariotas. Como features foram utilizados k-mers de tamanho 2, 3 e 4, comumente empregados em ferramentas de Bioinformática [Melsted and Pritchard 2011]. Ademais, esses dados foram estruturados de acordo com a taxonomia hierárquica de [Wicker et al. 2007], além de serem formatados para que algoritmos de AM possam executá-los. A seguir são apresentadas as avaliações dos classificadores. 5.1 Avaliação dos Classificadores com Métricas não-Hierárquicas Na Tabela I são apresentados os resultados obtidos por classificadores avaliados de acordo com as métricas de Precisão (P ), Revocação (R) e Medida-F (F ). Procurou-se incluir ao menos um exemplo de classificador para cada paradigma de classificação, como o C4.5 (Simbólico), SVM (Probabilístico), o KNN (Baseado em Exemplos), além do HC-GA que é de natureza exploratória estocástica.

Table I.

Comparação de Resultados considerando Métricas não-Hierárquicas

C4.5

NB

KNN

SVM

MLP

HC-GA

0.68 ± 0.022 0.69 ± 0.012 0.68 ± 0.017

0.41 ± 0.141 0.30 ± 0.110 0.32 ± 0.072

0.51 ± 0.079 0.49 ± 0.075 0.50 ± 0.077

0.24 ± 0.086 0.22 ± 0.056 0.22 ± 0.051

0.24 ± 0.036 0.22 ± 0.031 0.23 ± 0.033

PGSB P R F

0.66 ± 0.011 0.66 ± 0.011 0.66 ± 0.011

0.45 ± 0.041 0.20 ± 0.009 0.28 ± 0.016

0.80 ± 0.009 0.80 ± 0.009 0.80 ± 0.009

P R F

0.46 ± 0.008 0.46 ± 0.007 0.46 ± 0.007

0.38 ± 0.111 0.11 ± 0.003 0.16 ± 0.014

0.69 ± 0.006 0.68 ± 0.007 0.69 ± 0.006

REPBASE 0.56 ± 0.013 0.49 ± 0.004 0.52 ± 0.008

Esse experimento foi realizado a fim de responder duas perguntas de pesquisa: (i) Quão bom o classificador hierárquico HC-GA é em predizer as classes folhas de TEs; e (ii) Quais classificadores flat apresentam maiores taxas de acerto nas classes folhas de TEs. Como é possível observar, o classificador KNN obteve as melhores médias de resultados, sendo superior aos demais nas três medidas 1 http://pgsb.helmholtz-muenchen.de/plant/ 2 http://www.girinst.org/repbase/

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

74

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

·

7

consideradas e nos dois conjuntos de dados, além de apresentar um desvio padrão baixo. Por outro lado, o método hierárquico HC-GA não conseguiu resultados perto dos melhores classificadores. Portanto, conclui-se que o método HC-GA ainda precisa de ajustes a fim de aumentar as taxas de acertos nos níveis mais profundos da hierarquia de classes, pois muitas vezes estas provêm informações relevantes para o domínio [Silla and Freitas 2010]. Além disso, esse experimento revelou potencial em alguns classificadores tradicionais, que podem ser boas opções para CH caso sejam adaptados. 5.2 Avaliação dos Classificadores com Métricas Hierárquicas Na Tabela II são apresentados os resultados considerando as medidas de Precisão Hierárquica (hP ), Revocação Hierárquica (hR) e a Medida-F Hierárquica (hF ). Aqui buscou-se verificar junto ao HCGA, se algum método flat apresenta bons resultados na CH mesmo sem ser projetado para tal.

Table II. C4.5

Comparação de Resultados considerando Métricas Hierárquicas NB

KNN

SVM

MLP

HC-GA

0.28 ± 0.002 0.70 ± 0.006 0.40 ± 0.003

0.73 ± 0.188 0.16 ± 0.117 0.24 ± 0.145

0.80 ± 0.044 0.67 ± 0.040 0.72 ± 0.033

0.05 ± 0.009 0.42 ± 0.093 0.10 ± 0.016

0.64 ± 0.066 0.51 ± 0.033 0.56 ± 0.044

PGSB hP hR hF

0.73 ± 0.013 0.38 ± 0.007 0.50 ± 0.009

0.41 ± 0.012 0.21 ± 0.006 0.27 ± 0.008

0.14 ± 0.001 0.90 ± 0.004 0.24 ± 0.001

hP hR hF

0.08 ± 0.001 0.94 ± 0.005 0.14 ± 0.001

0.11 ± 0.003 0.04 ± 0.001 0.06 ± 0.002

0.07 ± 0.001 0.89 ± 0.003 0.14 ± 0.001

REPBASE 0.11 ± 0.001 0.27 ± 0.001 0.16 ± 0.001

Neste comparativo os resultados considerando medidas hierárquicas mostram uma clara superioridade do método hierárquico proposto. O HC-GA foi superior aos demais nas medidas hP e hF em ambos os conjuntos de dados, porém, alguns métodos flat apresentaram melhores resultados para a medida hR. Isso é esperado, uma vez que Precisão e Revocação tendem a ser inversamente proporcionais. Como os métodos flat não consideram a hierarquia de classes, acabam assinalando instâncias a classes que não são necessariamente suas classes mais específicas, mas que são parte destas, além de muitas vezes, não completarem o caminho hierárquico de classes, o que acaba aumentando a Revocação e diminuindo a Precisão. Como exemplo, um nó classificado em 2/1 (“/” separa os níveis) que é assinalado em 2, 2/1, 2/1/1 e 2/1/1/1, pontuará em Revocação mas será penalizado em Precisão. Um ponto que pode ter sido relevante para o HC-GA perder na hR é a especificidade das regras geradas. Quando uma regra é muito específica, essa tende a cobrir menos instâncias o que geralmente aumenta a Precisão e diminui a Revocação. De qualquer foram, mesmo o HC-GA obtendo valores inferiores em hR, isso não significa que seu desempenho preditivo seja pior, pois aqui a medida de maior importância é o hF que combina as outras duas. Algumas análises acerca dos classificadores flat são pertinentes. Esse experimento mostrou que, ao menos para a base PGSB, o classificador NB teve um desempenho preditivo semelhante nas duas avaliações, apresentando os valores P = 0.45, R = 0.20 e F = 0.28, e posteriormente hP = 0.41, hR = 0.21 e hF = 0.27, o que indica uma certa robustez ao problema. Além disso, observou-se que os métodos C4.5 e KNN apresentaram altos valores para hR. Isso pode ter ocorrido devido ao desbalanceamento das classes no PGSB e REPBASE, fazendo com que os métodos assinalassem as classes majoritárias as instâncias, ocorrendo o que já fora mencionado sobre Precisão x Revocação. 6.

CONCLUSÕES E TRABALHOS FUTUROS

Esse trabalho introduziu o HC-GA, um novo método global de CH capaz de gerar regras de classificação com boa acurácia e interpretabilidade, e que juntamente com classificadores flat, fora aplicado na Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

75

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

G. T. Pereira e R. Cerri

classificação de TEs. Três objetivos principais foram definidos: (i) apresentar um novo classificador global capaz de lidar com uma hierarquia de classes inteira e que induz um modelo interpretável e com bom desempenho preditivo, (ii) verificar a capacidade do HC-GA em predizer classes em níveis mais profundos da hierarquia, e (iii) avaliar a performance de classificadores tradicionais no problema de CH de TEs, comparando-os com o método proposto. Conforme fora verificado nos experimentos, classificadores flat não operam bem em tarefas de CH a não ser que sejam adaptados para tal. Nas comparações hierárquicas realizadas com o método HC-GA, este fora superior a todos os demais métodos considerados e em ambos os conjuntos de dados. Além disso, na avaliação realizada com métricas não-hierárquicas fora possível observar que, no geral os classificadores tradicionais se saíram bem, dando-se destaque para o método KNN. Acreditamos que esses experimentos foram válidos para verificar tanto o potencial de aplicação de classificadores flat não adaptados em problemas de CH, quanto sua habilidade de predizer classes folhas em hierarquias de TEs, além de ter sido possível descobrir problemas no HC-GA quanto a predições em classes folhas. Como trabalhos futuros visa-se adaptar alguns classificadores flat para que operem de forma hierárquica e assim verificar os ganhos de tal adaptação, além de serem implementados mecanismos para otimizar as predições do HC-GA em classes folhas. AGRADECIMENTOS

Agradecemos a Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) e a Fundação de Amparo à Pesquisa do Estado de São Paulo (Processos 15/14300-1 e 16/50457-5). REFERÊNCIAS Biémont, C. A brief history of the status of transposable elements: from junk dna to major players in evolution. Genetics 186 (4): 1085–1093, 2010. Ghazi, D., Inkpen, D., and Szpakowicz, S. Hierarchical versus flat classification of emotions in text. In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text. Association for Computational Linguistics, pp. 140–146, 2010. Holland, J. 1975, adaptation in natural and artificial systems, university of michigan press, ann arbor, 1975. Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O., and Walichiewicz, J. Repbase update, a database of eukaryotic repetitive elements. Cytogenetic and genome research 110 (1-4): 462–467, 2005. Kiritchenko, S., Matwin, S., Nock, R., and Famili, A. F. Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Conference of the Canadian Society for Computational Studies of Intelligence. Springer, pp. 395–406, 2006. Loureiro, T., Camacho, R., Vieira, J., and Fonseca, N. A. Boosting the detection of transposable elements using machine learning. In 7th International Conference on Practical Applications of Computational Biology & Bioinformatics. Springer, pp. 85–91, 2013. Melsted, P. and Pritchard, J. K. Efficient counting of k-mers in dna sequences using a bloom filter. BMC bioinformatics 12 (1): 333, 2011. Mitchell, T. M. et al. Machine learning, 1997. Silla, C. and Freitas, A. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery vol. 22, pp. 31–72, 2010. Silla, CarlosN., J. and Freitas, A. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22 (1-2): 31–72, 2011. Vens, C., Struyf, J., Schietgat, L., Džeroski, S., and Blockeel, H. Decision trees for hierarchical multi-label classification. Machine Learning 73 (2): 185–214, 2008. Wetzel, A. Evaluation of the effectiveness of genetic algorithms in combinatorial optimization, 1983. Wicker, T., Sabot, F., Hua-Van, A., Bennetzen, J. L., Capy, P., Chalhoub, B., Flavell, A., Leroy, P., Morgante, M., Panaud, O., et al. A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics 8 (12): 973–982, 2007. Zimek, A., Buchwald, F., Frank, E., and Kramer, S. A study of hierarchical and flat classification of proteins. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7 (3): 563–571, 2010.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

76

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Using graph-based centrality measures for sentiment analysis G. N. Vilarinho, M. T. Machado, E. E. S. Ruiz Universidade de São Paulo, Brazil [email protected], [email protected], [email protected] Abstract. Recent studies have shown that natural language processing and graph theory are intimately connected disciplines. This article presents some early experiments to investigate if sentiment analysis (SA) can be performed using graph-based centrality measures. The experiments are performed on a dataset of movie reviews. A sequence of words in a review to have its sentiment polarity to be evaluated is used to build a subgraph. This subgraph is then compared by its nodes importance, according to graph centrality measures, against two other large graphs, a positive and negative graph, respectively representing positive and negative reviews classes. Experimental results show that graph theoretical measures are encouraging indexes towards SA classification. Categories and Subject Descriptors: E.1 [Data]: Graphs and networks; I.2.7 [Artificial Intelligence]: Natural Language Processing; I.5.4 [Artificial Intelligence]: Text processing Keywords: sentiment analysis, complex networks, centrality measures, text processing, natural language processing

1.

INTRODUCTION

Sentiment analysis (SA) or opinion mining (OM) is the computational approach for the analysis of sentiments, attitudes and emotions expressed in free text about a certain subject [Liu 2012]. More often, SA is related to the identification and analysis of the sentiment expressed in a text with the purpose of classifying this text as positive, negative or neutral [Medhat et al. 2014]. With the advent of the web 2.0 there has been an increased amount of user generated text, most notably on social networks. This phenomena may be interpreted as a high dimension set of data overloaded with opinions, some of them expressed in a direct way. These opinions usually reveal what the web user feels about other people, products, movies, places, sports, services, events and others [Ravi and Ravi 2015]. This may explain why, for more than 15 years, SA has been a major topic in Artificial Intelligence and Natural Language Processing (NLP), to cite only these two computer science areas. Among the various SA applications there are plenty of applications able to anticipate general elections, movies box offices, to foresee epidemics, stock market movements, to detect spam, credit card fraud and others [Medhat et al. 2014]. One may group these SA applications under three distinctive approaches, which are: a) the lexical approaches that work based on pre-classified terms from a certain language; b) the ones supported by machine learning (ML) methods that although may be very efficient usually demand a training stage based on a previous human classification procedure; c) and the hybrid schemes which encompass the last two. Recently graph-based approaches have emerged as a fourth action plan towards NLP applications [Scheible et al. 2010; Aisopos et al. 2012; Gao et al. 2015; Giachanou and Crestani 2016], not neglecting its first application, two decades ago, to produce a sentiment lexicon in English [Hatzivassiloglou and McKeown 1997]. While the ML and lexicon based approaches are heavily depended on the language, the graph theory comes as satisfactory structural sentence representation that incorThe authors thank the research funding agency CAPES for the scholarship granted to G. N. Vilarinho. c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

77

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

G. N. Vilarinho and M. T. Machado and E. E. S. Ruiz

porates sentiments [Hatzivassiloglou and McKeown 1997]. Lately Violos and colleagues [Violos et al. 2016] reported a method called Word-Graph Sentiment Analysis Method (WSAM), a graph-based method designed to identify the sentiment expressed in microblog posts using the sequence of the words in a post. This paper investigates the viability of graph-based centrality measures to classify sentiment polarity in texts, in other words, classify the sentiment expressed as either positive or negative. This paper differs from the latter cited [Violos et al. 2016] as it addresses larger term graphs than the one formed from Twitter microblogs that produce graphs from terms expressed in a 140 character window. Also, due to the graph dimensions derived from the long movie review texts used, the proposed centrality measures are used instead of the Maximum Common Subgraph-based similarity metric adopted by Violos. This early experiment is part of an ongoing broader investigation about the use of graph theoretical approaches for sentiment analysis. 2.

DATA AND METHODS

The movie reviews used in this experiment are from the Large Movie Review Dataset (LMRD) [Maas et al. 2011]1 . The LMRD is a dataset for binary sentiment classification containing 50,000 highly polar movie reviews split into two sets, a training set and a test set, containing 25,000 reviews each. Part of the LMRD training set was used to construct two large word-graphs, here called the N and the P learning graphs, respectively the word-graph composed from negative movie reviews and the positive movie reviews. Both graphs are the training sets of the experiments reported here. Likewise, the test set consists of 200 graphs was also built from the LMRD movie reviews where they were previously classified as of either positive or negative polarity. Prior to the graph construction, all the reviews had been normalized. This preprocessing phase consisted of the removal of all HTML tags and punctuation marks, followed by the Porter stemming and the expansion of contraction words, such as: “didn’t” to “did not” and “I’ll” to “I will”. The classification method is explained in three steps, which are: Step 1. The construction of the learning word-graphs is accomplished by the number of words in a word frame, or word window. Considering, i.e. a frame size of two words, e.g. w1 w2 , two vertices representing the words w1 and w2 are joined by an edge if the second word, w2 , immediately succeeds the first word, w1 . The edges are directed and unweighted, thus capturing the sequence of the words appearing in a document. Considering two positive reviews word-graphs, i.e. G1 and G2 , the learning graph is built this way: Allow G1 a vertex v1 linked to vertices v2 and v3 , while for G2 the same vertex v1 is linked to v2 and v4 . The resulting graph of G1 ∪ G2 will have vertex v1 connected to v2 , v3 and v4 . The graph N was built using the negative reviews, while the P graph was accomplished using the positive reviews. These two graphs will stand as reference graphs for two classes of reviews, the positive and the negative class. Step 2. To classify the sentiments expressed in reviews in the test dataset, each review is temporarily added to both word-graphs, N and P . The addition of a new graph to each one of the learning graphs may change the latter structure either if a new vertice is added or if a new edge is needed to link to words previously not connected in the reference graph . Our hypothesis is that once a new vertice, or a new edge, is added the graph the structural change emerging from this addition may be measured. Our belief is that a positive review added to a positive class word-graph P will cause less structural changes than if was added to a negative class word-graph, N . Step 3. The vertices belonging to the text review word-graph that was added to the learning graphs had their centrality measures computed. Consider the word-graph Gt representing a single review from 1 http://ai.stanford.edu/~amaas/data/sentiment/

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

78

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

·

3

the test set. Allow wti the nodes (words) of Gt . Centrality measures are computed for each wti when Gt was added to either N graph or the P graph. A similarity measure based on a voting scheme is used to evaluate to each of the learning graphs, N or P , Gt is more similar. In graph theory centrality indices measure the importance of each node in a graph. Our proposal is to use centrality indices to measure eventual structural change caused by the addition of a new review to a graph. Various centrality measures are known and among these we have used the followings: node degree centrality, node eigenvector centrality, PageRank centrality, node betweenness, edge betweenness, Katz centrality and assortativity centrality. The node degree centrality refers to the number of neighbor nodes one is attached. The degree centrality of a node v, is the number of nodes that v is connected to. The eigenvector centrality a recursive measure where the centrality of each vertex is proportional to the sum of the centralities of its neighbors. Briefly, PageRank seems as a voting system of one node by all other nodes on the graph, about how important a node is. An edge to a node counts as a vote. No edge no vote. Betweenness is a centrality graph measure based on the shortest paths between two nodes or two edges. Considering nodes p and q, this measure counts the number of shortest paths between these two nodes that goes through a node v. Edge betweenness operates similarly to betweenness, but counts the number of nodes’ shortest paths going through an edge e. The Katz centrality is a metric that accounts the importance of a node considering the paths of different lengths that start on a node v and directs to other network nodes. Mihalcea and Radev explain in detail all theses centrality measures in their book [Mihalcea and Radev 2011]. While all the previous measures are local measures taken at each node of the test review word-graph inserted, the assortativity centrality is a global measure that is often implemented as a correlation between two nodes. Among the various methods that might be used to capture such correlation, the Pearson correlation coefficient of degree between pairs of linked nodes was the method adopted for this work. 3.

PRELIMINARY RESULTS AND DISCUSSION

Our experiment is based on the addition of a word-graph review to the pair of reference graphs, N and P , respectively the negative and the positive large review graphs. Following the graph addition, the eventual structural change in both reference graphs is measured. A pair of reference learning graphs, N and P , were constructed first from a collection of 100 reviews and later from 300 reviews. Table I below shows some characteristics of these two pairs of wordgraphs. The negative and the positive graphs are structurally comparable, even considering that negative graph is slightly larger in the number of nodes when compared with the positive graph.

Nodes Edges Table I.

100 reviews P N 2,145 2,208 7,815 7,716

300 reviews P N 2,929 3,123 11,490 12,540

Characteristics of the positive and negative review graphs, P and N .

Table II shows the precision, recall, F-measure and accuracy resulting from the classification of the test dataset. It reveals modest results for precision, F-measures and accuracy. As the learning graph gets larger, all the measures show more suitable results. We consider the values of precision and recall from the larger word-graph pair as encouraging results since we intend to pursue experiments using weighted graphs and also considering larger word frames. Experiments with the same dataset show accuracy values of 73.7% for Naïve Bayes [Narayanan et al. 2013] and 88.7% for unigram and bigram mixed modeling classified with logistic regression [Goyal and Parulekar 2015]. The experiments performed, as well as precision, recall and accuracy values obtained Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

79

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

G. N. Vilarinho and M. T. Machado and E. E. S. Ruiz Precision Recall F-measure Accuracy Table II.

100 reviews 50.00 82.00 62.12 50.00

300 reviews 63.57 89.00 74.17 69.00

Precision, recall, F-measure and accuracy values (in percentage ‘%’).

from a large number of reviews (300), demonstrate that the proposed method has potential of being a solution for SA, and that it might bring similar classification results to lexicon and machine learning methods. REFERENCES Aisopos, F., Papadakis, G., Tserpes, K., and Varvarigou, T. Content vs. context for sentiment analysis: a comparative analysis over microblogs. In Proceedings of the 23rd ACM Conference on Hypertext and Social Media. ACM, pp. 187–196, 2012. Gao, D., Wei, F., Li, W., Liu, X., and Zhou, M. Cross-lingual sentiment lexicon learning with bilingual word graph label propagation. Computational Linguistics, 2015. Giachanou, A. and Crestani, F. Like it or not: A survey of twitter sentiment analysis methods. ACM Computing Surveys (CSUR) 49 (2): 28, 2016. Goyal, A. and Parulekar, A. Sentiment analysis for movie reviews, 2015. Hatzivassiloglou, V. and McKeown, K. R. Predicting the semantic orientation of adjectives. In Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 174–181, 1997. Liu, B. Sentiment Analysis and Opinion Mining: Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, San Rafael 5 (1): 1–167, 2012. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, pp. 142–150, 2011. Medhat, W., Hassan, A., and Korashy, H. Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal 5 (4): 1093–1113, 2014. Mihalcea, R. and Radev, D. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. Narayanan, V., Arora, I., and Bhatia, A. Fast and accurate sentiment classification using an enhanced naive bayes model. In International Conference on Intelligent Data Engineering and Automated Learning. Springer, pp. 194–201, 2013. Ravi, K. and Ravi, V. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowledge-Based Systems vol. 89, pp. 14–46, 2015. Scheible, C., Laws, F., Michelbacher, L., and Schütze, H. Sentiment translation through multi-edge graphs. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, pp. 1104–1112, 2010. Violos, J., Tserpes, K., Psomakelis, E., Psychas, K., and Varvarigou, T. A. Sentiment analysis using wordgraphs. In WIMS. pp. 22, 2016.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

80

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A novel probabilistic Jaccard distance measure for classification of sparse and uncertain data I. Martire1 , P. N. da Silva1 , A. Plastino1 , F. Fabris2 , A. A. Freitas2 1

2

Universidade Federal Fluminense, Brazil [email protected], {psilva, plastino}@ic.uff.br School of Computing, University of Kent, Canterbury, Kent, CT2 7NF, UK {F.Fabris, A.A.Freitas}@kent.ac.uk

Abstract. Classification is one of the most important tasks in the data mining field, allowing patterns to be leveraged from data in order to try to properly classify unseen instances. Also, more and more often, the classification task has to be performed on datasets containing uncertain data. Although an increasing number of studies have been developed to handle uncertainty in classification in the last decade, there are still many underexplored scenarios — such as sparse data, usual in the bioinformatics field. Thus, in this work, we propose a novel distance measure for sparse and uncertain binary data based on the widely used Jaccard distance, testing its performance using the 1NN classifier. We evaluate the classification performance of our proposed method on 28 biological aging-related datasets with sparse and probabilistic binary features and compare it with a common technique to handle uncertainty by employing data transformation and traditional classification. The experimental results show that our proposed distance measure has both a smaller runtime and a better predictive performance than the traditional transformation approach. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications; I.5.4 [Pattern Recognition]: Applications; J.3 [Computer Applications]: Life and Medical Sciences Keywords: aging-related genes, classification, data mining, Jaccard distance, uncertain data

1.

INTRODUCTION

The classification task is one of the most relevant tasks in the data mining field [Han et al. 2011]. Given a dataset of pre-labeled instances, the classification task comprises the induction of a classification model that is capable of predicting the class of an unseen instance based solely on its features. These features can have a numerical or categorical domain, with certain or uncertain values. Naturally, because of their higher prevalence, the majority of the techniques that have been developed so far focus on the handling of certain data [Aggarwal 2014]. In this work, we focus on the classification of uncertain data, specifically in sparse datasets. This kind of data can originate from many sources due to various factors, such as measurement precision limits, measurement errors, approximations or even lack of information. Even though the number of studies on classifying uncertain data has significantly increased in the last decade [Aggarwal 2014], there are still many underexplored areas, as is the case of sparse datasets. We are particularly interested in the study of aging-related genes (represented as instances in our datasets) in order to identify the effect of genes on the longevity of an organism. These datasets commonly use binary features extracted from the Gene Ontology (GO) database [Ashburner et al. 2000], but another important type of feature are protein-protein interactions (PPIs) [Stojanova et al. 2013]. PPI features indicate whether or not an aging-related protein interacts with each of a set of

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

81

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

I. Martire and P. N. da Silva and A. Plastino and F. Fabris and A. A. Freitas

other proteins (which may or may not be aging-related proteins). For that purpose, we can use the STRING database [Szklarczyk et al. 2014], a popular source of PPI datasets in the bioinformatics literature. Note, however, that instead of providing binary values for the PPIs, the STRING database provides confidence scores for each interaction. This allows the dataset to present more PPI data, but adds uncertainty to it. When not working with uncertain data, an approach to classify genes described by binary features in the aging literature is to use the k -Nearest Neighbors classifier with the Jaccard distance [Wan et al. 2015] [Wan and Freitas 2017]. Since the Jaccard distance is not able to directly handle uncertain binary values, a data transformation procedure would be required to "remove" the uncertainty, which, of course, could cause loss of valuable data. A simple and common transformation is applying a cut-off on the PPI values, so that when the confidence score is over (below) a certain value it is converted to 1 (0). The problem would then lie on how to choose an appropriate cut-off value. However, in the bioinformatics literature there is usually no concern on optimizing this value and not even an explanation about the reasons behind its choice. Thus, the main contribution of this work is to provide an intuitive, fast and accurate method to handle uncertain PPI data in distance-based classification. For that purpose, we propose a novel Jaccard distance measure able to handle uncertain binary features, without requiring any data transformation procedure or parameter optimization. Also, it allows the algorithm to benefit from the uncertain information available, removing the need to rely on arbitrary cut-off values or to spend much time on optimizing its value. The remainder of this article is organized as follows. Section 2 describes the related work. In Section 3, we introduce the novel distance measure for classification in sparse datasets with probabilistic binary features. Section 4 presents the datasets used in this work. Computational results are presented in Section 5. Lastly, in Section 6, we present the conclusions and future research directions. 2.

RELATED WORK

The classification of uncertain data has been extensively studied in the last two decades. Many different techniques have been adapted to handle uncertain data, such as Bayesian approaches [Ren et al. 2009], Neural Networks [Ge et al. 2010], Decision Trees [Tsang et al. 2011], k -Nearest Neighbors [Yang et al. 2015] and Support Vector Machines [Yang and Li 2009]. Most of them focus on uncertain numerical features, not specifically on binary features. Notwithstanding, very few uncertain data mining studies focus on sparse datasets, and they are usually related to other tasks, such as Frequent Itemset Mining [Xu et al. 2014]. As mentioned in the previous section, a lot of the research done so far in the bioinformatics field has simply ignored the uncertain information provided by the STRING database about PPIs. This has been done by applying ad-hoc cut-off values such as 0.4 [Shi et al. 2017], 0.7 [Gao et al. 2017], and 0.9 [Lin et al. 2016]. 3.

A NOVEL PROBABILISTIC JACCARD DISTANCE MEASURE

Distance-based classifiers use the intuitive idea that instances of the same class are more similar among themselves than among instances of other classes [Han et al. 2011]. Similarity (distance) measures, like the Jaccard index (distance), are functions that calculate how similar (distant) two objects are to (from) each other, and thus are the basis of supervised distance-based classification algorithms. Next, we present the definitions of the traditional Jaccard measure (which cannot directly handle uncertainty, since a data transformation is needed to handle it) and of our proposed distance measure (which handles uncertainty directly). Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

82

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A novel probabilistic Jaccard distance measure for classification of sparse and uncertain data

·

3

Let sj and sj 0 be the sets of binary features with positive value (the least frequent value for each feature) in instances j and j 0 respectively. The Jaccard index is defined as in Equation (1). In the special case when both sj and sj 0 are empty, the Jaccard index is defined to be equal to 1.

Jaccard(sj , sj 0 ) =

|sj ∩ sj 0 | |sj ∪ sj 0 |

(1)

And the Jaccard distance between j and j 0 is simply defined as: δJaccard (j, j 0 ) = 1 − Jaccard(sj , sj 0 ).

(2)

Note that Equation (1), and consequently Equation (2), are limited to scenarios with binary feature values without uncertainty. We then propose an extension of the Jaccard index to take into account the probability pi (sj ) of a binary feature i (of a total of n features in the dataset) belonging to sj , i.e., having positive value in instance j. Equation (3) defines this new similarity coefficient, here called ProbJaccard (Probabilistic Jaccard measure). Again, we define ProbJaccard(sj ,sj 0 ) = 1 when the denominator evaluates to zero, which happens when both sets are certainly empty. Pn

× pi (sj 0 )] 0 i=1 [pi (sj ) + pi (s ) − pi (sj ) × pi (sj )]

ProbJaccard(sj , sj 0 ) = Pn

i=1 [pi (sj )

(3)

j0

Like Equation (1), the numerator of Equation (3) measures the degree of intersection between the two instances, while the denominator measures the degree of union between the two instances. Note however, that these degrees of intersection and union are probabilistic in Equation (3). Analogously, we define the Probabilistic Jaccard distance between j and j 0 as: δProbJaccard (j, j 0 ) = 1 − ProbJaccard(sj , sj 0 ).

(4)

Note that all these indexes and distances take values in the interval [0,1]. Also note that, when working with certain data, Equations (3) and (4) become equivalent to Equations (1) and (2), and, thus, they can be used in datasets with both certain and uncertain binary features. 4. EXPERIMENTAL DATASETS We use 28 datasets of aging-related genes, where instances are genes and the binary class indicates whether or not the genes are related to longevity. These datasets were created by integrating data from the Human Ageing Genomic Resources (HAGR) GenAge database (version: 335 Build 17) [de Magalhães et al. 2009] and the Gene Ontology (GO) database (version: 2015-10-10) [Ashburner et al. 2000]. HAGR is a database of aging- and longevity-associated genes in model organisms which provides aging information for genes from four model organisms: C.elegans (worm), D.melanogaster (fly), M.musculus (mouse) and S.cerevisiae (yeast). The GO database provides information about three ontology types: biological process (BP), molecular function (MF) and cellular component (CC). Each ontology contains a separate set of GO terms (features). So, for each of the four model organisms, we created seven datasets, with seven combinations of feature types, denoted by BP, CC, MF, BP.CC, BP.MF, CC.MF, and BP.CC.MF. Hence, each dataset contains instances (genes) from a single model organism. Each instance is formed by a set of binary features indicating whether or not the gene is annotated with each GO Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

83

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

I. Martire and P. N. da Silva and A. Plastino and F. Fabris and A. A. Freitas

term and a binary class variable indicating if the instance is either positive ("pro-longevity" gene) or negative ("anti-longevity" gene) according to the HAGR database. These GO features values are highly sparse and, in order to avoid overfitting, GO terms which occurred in less than three genes were discarded, avoiding the use of rare features with very little statistical support and virtually no generalization power for our set of genes. Finally, as a contribution to the aging-related genes classification problem, in order to improve the predictive performance achieved when using only GO terms [Wan et al. 2015], we added proteinprotein interactions (PPIs) uncertain data from the STRING database (version: 10) [Szklarczyk et al. 2014] to each of the 28 datasets. The data is also highly sparse and, as we did with the GO features, we also filtered out the PPIs that only occurred in less than three genes. These PPI features values were obtained, in the STRING database, from the combined_score field in the network.node_node_links table. Their values s ∈ [0, 1] indicate the degree of confidence of their correspondent interactions. We use these values under a probabilistic perspective, where the features can be seen as binary ones (with the value 0 (1) indicating absence (presence) of the correspondent PPI in that instance’s set of PPIs) and their values are represented by a probability distribution function f , defined as f (1) = s and f (0) = 1 − s. Table I shows statistics for each dataset, including information on their sparsity. For each of the four model organisms, each of the seven rows shows information about a specific dataset. The first column identifies the model organism. The second column shows the selected Gene Ontologies on the dataset. The other columns show, respectively, the number of features, the number (and percentage) of GO features, the number of PPI features, the average percentage of GO features with value 0 in an instance, the average percentage of PPI features with value 0 in an instance, the number of instances, the number (and percentage) of positive-class instances and the number of negative-class instances. For example, for the C. elegans dataset with GO terms of the Biological Process (BP) ontology type only (first row), out of the 12,438 features, 991 (7.97%) are GO features and the remaining 11,447 (92.03%) are PPI features. Also, the column "avg. % GO = 0" shows that, on average, an instance of that dataset has 95.48% of its GO features with value 0 and the column "avg. % PPI = 0" shows that, on average, an instance of that dataset has 95.32% of its PPI features with value 0. Finally, the last three columns show that this dataset has 657 instances, from which only 226 (34.40%) are labeled positive (Pos) and the remaining 431 (65.60%) are labeled negative (Neg). 5.

EXPERIMENTS

In our datasets, as shown in Table I, the distribution of instances belonging to the two classes is imbalanced. Then, if the simple accuracy measure (the percentage of correctly classified instances) had been used, it would provide us with misleading performance evaluation since we could trivially obtain a high accuracy (but no useful model) by predicting the majority class for all instances [Japkowicz and Shah 2011]. Hence, we evaluate the predictive performance of the classifiers by using the √ value of Geometric mean (Gmean), defined as Gmean = Sens × Spec, which takes into account the balance of the classifiers’s sensitivity (Sens) and specificity (Spec) [Japkowicz and Shah 2011]. Sensitivity (specificity) means the proportion of pro-longevity (anti-longevity) genes that were correctly predicted as pro-longevity (anti-longevity) in the testing dataset [Altman and Bland 1994]. The reported Gmean value for each dataset is the average of all the 10 Gmean values generated by the well-known stratified 10-fold cross-validation procedure [Witten et al. 2016]. In this work, we use the 1-Nearest Neighbor (1NN) classifier since, in previous work, it has been shown effective for classification in aging-related datasets [Wan et al. 2015][Wan and Freitas 2017]. We start by testing the improvement in predictive performance when the PPI features are added to the original database composed of GO terms only. Since this inserted data is uncertain and the Jaccard distance does not handle uncertain values, we decided to use, as a baseline, a 5-fold Internal Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

84

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A novel probabilistic Jaccard distance measure for classification of sparse and uncertain data Table I: Statistics for each dataset.

S. cerevisiae

M. musculus

D. melanogaster

C. elegans

Organism

Dataset BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF

# features 12438 11163 11151 12626 12733 11731 12912 7359 6549 6698 7503 7559 6817 7648 11513 10236 10323 11655 11753 10563 11895 6305 5606 5682 6450 6526 5827 6671

# (%) GO features 991 (7.97) 178 (1.59) 263 (2.36) 1169 (9.26) 1254 (9.85) 441 (3.76) 1432 (11.09) 800 (10.87) 89 (1.36) 145 (2.16) 889 (11.85) 945 (12.50) 234 (3.43) 1034 (13.52) 1332 (11.57) 142 (1.39) 240 (2.32) 1474 (12.65) 1572 (13.38) 382 (3.62) 1714 (14.41) 844 (13.39) 145 (2.59) 221 (3.89) 989 (15.33) 1065 (16.32) 366 (6.28) 1210 (18.14)

# PPI features 11447 10985 10888 11457 11479 11290 11480 6559 6460 6553 6614 6614 6583 6614 10181 10094 10083 10181 10181 10181 10181 5461 5461 5461 5461 5461 5461 5461

avg. % GO = 0 95.48 93.35 94.93 95.47 95.65 95.01 95.62 91.68 86.98 92.28 91.38 91.89 90.72 91.56 89.35 83.20 90.27 88.79 89.53 87.93 89.03 94.65 89.96 94.27 93.96 94.57 92.56 94.02

avg. % PPI = 0 95.32 94.63 94.58 95.35 95.35 94.87 95.37 91.11 90.85 90.92 91.20 91.20 91.17 91.20 90.04 90.11 89.86 90.04 90.04 90.04 90.04 92.25 92.25 92.25 92.25 92.25 92.25 92.25

# instances 657 484 504 664 663 566 667 132 122 126 133 133 130 133 109 107 106 109 109 109 109 331 331 331 331 331 331 331

# (%) Pos 226 (34.40) 176 (36.36) 190 (37.70) 228 (34.34) 227 (34.24) 205 (36.22) 229 (34.33) 95 (71.97) 86 (70.49) 89 (70.63) 95 (71.43) 95 (71.43) 92 (70.77) 95 (71.43) 75 (68.81) 73 (68.22) 72 (67.92) 75 (68.81) 75 (68.81) 75 (68.81) 75 (68.81) 44 (13.29) 44 (13.29) 44 (13.29) 44 (13.29) 44 (13.29) 44 (13.29) 44 (13.29)

·

5

# Neg 431 308 314 436 436 361 438 37 36 37 38 38 38 38 34 34 34 34 34 34 34 287 287 287 287 287 287 287

Cross-Validation (ICV) method (accessing the training set only) to automatically choose a cut-off value to discretize the feature (feature values greater or equal than the cut-off are set to 1 and set to 0 otherwise). This ICV is performed in each iteration of the external cross-validation procedure. This baseline method is here called Jaccard-ICV. Applying this cut-off on the uncertain data allows us to convert it to certain binary values and then use it with the 1NN classifier using the traditional Jaccard distance. The STRING database online search interface suggests four cut-off values: 0.15, 0.40, 0.70 and 0.90, meaning, respectively, low, medium, high and highest confidence. These values have also been extensively employed in the related literature [Lin et al. 2016] [Shi et al. 2017] [Gao et al. 2017]. For these two reasons, the ICV focused on choosing the best out of these four cut-off values. The results are shown in Table II, where the boldface numbers denote the highest Gmean value obtained for each dataset. The first two columns are the same as in Table I, explained in the previous section. The third column shows the Gmean values obtained by the 1NN classifier in datasets with GO features only and using the traditional Jaccard distance metric. The fourth and fifth columns show the values obtained with the classification on the datasets composed of both GO and PPI data. While the fourth column shows the results with the internal cross-validation approach explained above, the fifth column shows the results obtained when using 1NN with our new proposed distance measure. Each row represents a different dataset in the same way as in Table I. Table II, however, has two additional rows. The second to last row, Average Rank, shows the average rank obtained by each method over the 28 datasets. For each dataset, the best method receives the ranking value of 1; conversely, the worst method receives the ranking value of 3. So, the smaller the average rank of a method, the better its overall predictive performance. Finally, the last row, #Wins, shows the number of datasets where each method has obtained the best predictive performance. Again, the boldface numbers denote the best result in each of these two rows. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

85

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

I. Martire and P. N. da Silva and A. Plastino and F. Fabris and A. A. Freitas Table II: Comparison of predictive performance using Gmean as evaluation measure. GO Group

Dataset

S. cerevisiae

M. musculus D. melanogaster

C. elegans

BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF Average Rank # Wins

GO + PPI

Jaccard

Jaccard-ICV

Prob-Jaccard

55.91 59.73 53.47 61.14 58.07 60.33 58.11 64.17 70.44 50.65 61.87 62.88 58.69 62.57 62.98 50.74 53.94 61.84 63.81 56.61 62.27 53.69 50.61 40.34 58.32 51.03 41.56 53.60 2.71 2

65.30 61.01 64.86 65.25 67.15 63.12 68.49 52.39 72.08 60.52 55.05 63.24 64.14 65.49 68.31 56.27 65.64 55.56 66.29 67.23 63.49 57.34 53.56 58.69 55.88 57.83 63.74 57.32 1.75 11

64.13 63.21 66.41 66.44 65.19 62.30 66.25 61.13 68.03 58.13 65.19 63.81 64.93 63.30 63.07 63.95 69.18 56.81 65.30 68.89 58.51 58.26 61.45 58.99 65.39 58.29 60.73 62.88 1.54 15

The results in Table II show that Prob-Jaccard, which uses our proposed distance measure, achieves the best predictive performance on 15 datasets, followed by Jaccard-ICV (best results on 11 datasets) and Jaccard (2 datasets). To determine whether the differences in performance are statistically significant, we ran the non-parametric Friedman test followed by the Nemenyi test [Japkowicz and Shah 2011]. Both tests were used at the 0.05 significance level. The Friedman test indicated that there was at least one pair of classifiers with a statistical difference in the predictive performance. Hence, we employed the post-hoc Nemenyi test to discover in which pairs this difference occurs. The Nemenyi test showed that both Prob-Jaccard and Jaccard-ICV are significantly superior to Jaccard-GO, which does not include PPI features. However, even though Prob-Jaccard achieves both a better average rank and a higher number of wins than Jaccard-ICV, the difference in the performance between Prob-Jaccard and Jaccard-ICV was not statistically significant. One could think of using the Euclidean distance with the 1NN classifier by using the probability values as features values, thus leading to a scenario with "certain" numerical features instead of uncertain binary ones. A preliminary experiment using this strategy has been performed, obtaining very poor results when compared to the other two methods explored in this article. These results are somewhat intuitive, since the Euclidean distance is known to be weakly discriminant for multidimensional and sparse data, and also because treating a probability as just a numeric value can lead to wrong assumptions. As an example, think of the case when comparing the distance between two instances with a single uncertain binary feature, and assume this feature’s values for both instances are represented by the same probability distribution function f , for which f (0) = f (1) = 0.5. The Euclidean distance between these two instances would be zero, even though, if we assume that the (unknown) true value of a feature is binary (an assumption that may or may not be appropriate depending on the application domain), there is a 50% chance that these two instances have the opposite binary values for their single feature. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

86

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

A novel probabilistic Jaccard distance measure for classification of sparse and uncertain data

·

7

Based on the conducted experiments, we can notice a great improvement by simply adding the PPI features and optimizing the choice of cut-off value for each fold via internal cross-validation. However, this approach is slow, which could be a big problem when working with larger datasets. We then compared the runtime performance of the Jaccard-ICV method with our proposed ProbJaccard method to demonstrate how much faster this proposed method can be in comparison to the internal cross-validation one, without losing in overall predictive performance (and actually improving it most of the times). These results are presented in Table III. In this table, the first two columns are exactly the same as the ones in the previous tables. The third and fourth columns show the average time in seconds that was taken to classify a fold in the 10-fold cross-validation procedure. Notice that the reported times for the Jaccard-ICV method include the time spent in the selection of the cut-off value. The last column shows the ratio of the values in the third column to the values in the fourth column, which indicates how many times faster the Prob-Jaccard method is in comparison to Jaccard-ICV. These times were measured in a computer with 1.6 GHz Intel Core i5 processor and 4 GB 1600 MHz DDR3 memory. Table III: Comparison of average CPU time in seconds per cross-validation fold.

S. cerevisiae

M. musculus D. melanogaster

C. elegans

Group

Dataset BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF BP CC MF BP.CC BP.MF CC.MF BP.CC.MF Average

GO + PPI Jaccard-ICV Prob-Jaccard 18.048 0.835 9.577 0.399 10.802 0.438 20.492 0.845 20.316 0.861 13.394 0.594 21.181 0.855 0.639 0.023 0.492 0.018 0.469 0.016 0.705 0.020 0.676 0.021 0.527 0.020 0.718 0.022 0.639 0.023 0.492 0.018 0.469 0.016 0.705 0.020 0.676 0.021 0.527 0.020 0.718 0.022 3.796 0.112 3.321 0.097 3.274 0.105 4.008 0.116 3.925 0.118 3.488 0.108 4.190 0.126 5.314 0.233

Jaccard-ICV Prob-Jaccard 21.614 24.003 24.662 24.251 23.596 22.549 24.773 28.684 24.200 25.059 27.500 29.905 25.667 30.700 27.783 27.333 29.313 35.250 32.190 26.350 32.636 33.893 34.237 31.181 34.552 33.263 32.296 33.254 28.596

The last row of Table III shows that, on average, the Jaccard-ICV approach took 5.3 seconds to classify a single fold, while Prob-Jaccard took only 0.2 seconds. The last column in that row shows that the Prob-Jaccard approach was able to classify a single fold 28.6 times faster on average. 6. CONCLUSIONS In this work, we presented a novel Jaccard distance measure for nearest-neighbor classification in sparse datasets with probabilistic binary features. We compared both the speed and the predictive performance of the 1NN classifier using both our novel distance measure and the traditional Jaccard distance (by applying an internal cross-validation to optimize the cut-off value). Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

87

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

I. Martire and P. N. da Silva and A. Plastino and F. Fabris and A. A. Freitas

The 1NN classifier using the proposed ProbJaccard distance measure is significantly faster than the Jaccard-ICV method. This is due to the fact that ProbJaccard handles the uncertainty from the data directly, so there is no need to perform an internal cross-validation to optimize a cut-off parameter. Additionally, the proposed ProbJaccard method has shown an overall improvement in the predictive performance of the 1NN classifier across 28 aging-related datasets, with a better average rank and higher number of wins when compared with the Jaccard-ICV method and a dataset with GO terms only, as shown in Table II; even though there was no statistically significant difference between the results of ProbJaccard and Jaccard-ICV. Finally, this new distance measure can be extended to handle categorical features with more general types of uncertain values in sparse classification datasets. We leave this research for future work. REFERENCES Aggarwal, C. C. Data classification: algorithms and applications. CRC Press, 2014. Altman, D. G. and Bland, J. M. Diagnostic tests 1: sensitivity and specificity. British Medical Journal 308 (6943): 1552, 1994. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., et al. Gene Ontology: tool for the unification of biology. Nature genetics 25 (1): 25–29, 2000. de Magalhães, J. P., Budovsky, A., Lehmann, G., Costa, J., Li, Y., Fraifeld, V., and Church, G. M. The Human Ageing Genomic Resources: online databases and tools for biogerontologists. Aging cell 8 (1): 65–72, 2009. Gao, Y., Xu, D., Zhao, L., and Sun, Y. The DNA damage response of C. elegans affected by gravity sensing and radiosensitivity during the Shenzhou-8 spaceflight. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis 795 (1): 15–26, 2017. Ge, J., Xia, Y., and Nadungodage, C. UNN: a neural network for uncertain data classification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Hyderabad, India, pp. 449–460, 2010. Han, J., Pei, J., and Kamber, M. Data mining: concepts and techniques. Morgan Kaufmann, 2011. Japkowicz, N. and Shah, M. Evaluating learning algorithms: a classification perspective. Cambridge University Press, 2011. Lin, D., Zhang, J., Li, J., Xu, C., Deng, H.-W., and Wang, Y.-P. An integrative imputation method based on multi-omics datasets. BMC Bioinformatics 17 (1): 247, 2016. Ren, J., Lee, S. D., Chen, X., Kao, B., Cheng, R., and Cheung, D. Naive Bayes Classification of Uncertain Data. In IEEE International Conference on Data Mining. Miami, United States of America, pp. 944–949, 2009. Shi, J., Zhang, Y., Qi, S., Liu, G., Dong, X., Huang, N., Li, W., Chen, H., and Zhu, B. Identification of potential crucial gene network related to seasonal allergic rhinitis using microarray data. European Archives of Oto-Rhino-Laryngology 274 (1): 231–237, 2017. Stojanova, D., Ceci, M., Malerba, D., and Dzeroski, S. Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction. BMC Bioinformatics 14 (1): 285, 2013. Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., Huerta-Cepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K. P., et al. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research 43 (D1): D447–D452, 2014. Tsang, S., Kao, B., Yip, K. Y., Ho, W.-S., and Lee, S. D. Decision trees for uncertain data. IEEE transactions on knowledge and data engineering 23 (1): 64–78, 2011. Wan, C. and Freitas, A. A. An empirical evaluation of hierarchical feature selection methods for classification in bioinformatics datasets with gene ontology-based features. Artificial Intelligence Review , 2017. Wan, C., Freitas, A. A., and de Magalhães, J. P. Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods. IEEE/ACM Transactions on Computational Biology and Bioinformatics 12 (2): 262–275, 2015. Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016. Xu, J., Li, N., Mao, X.-J., and Yang, Y.-B. Efficient probabilistic frequent itemset mining in big sparse uncertain data. In Pacific Rim International Conference on Artificial Intelligence. Gold Coast, Australia, pp. 235–247, 2014. Yang, J.-L. and Li, H.-X. A probabilistic support vector machine for uncertain data. In IEEE International Conference on Computational Intelligence for Measurement Systems and Applications. Hong Kong, China, pp. 163– 168, 2009. Yang, L., Chen, H., Cui, Q., Fu, X., and Zhang, Y. Probabilistic-KNN: A novel algorithm for passive indoorlocalization scenario. In IEEE Vehicular Technology Conference. Glasgow, United Kingdom, pp. 1–5, 2015.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

88

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Improving Activity Recognition using Temporal Regions João Paulo Aires1 and Juarez Monteiro1 and Roger Granada1 and Felipe Meneguzzi2 and Rodrigo C. Barros2

1

Faculdade de Informática Pontifícia Universidade Católica do Rio Grande do Sul Av. Ipiranga, 6681, 90619-900, Porto Alegre, RS, Brazil Email: {joao.aires.001, juarez.santos, roger.granada}@acad.pucrs.br 2 Email: {rodrigo.barros, felipe.meneguzzi}@pucrs.br

Abstract. Recognizing activities in videos is an important task in computer vision area. Automatizing such task can improve the way we monitor activities since it is not necessary to have a human to watch every video. However, the classification of activities in a video is a challenging task since we need to extract temporal features that represent each activity. In this work, we propose an approach to obtain temporal features from videos by dividing the sequence of frames of a video into regions. Frames from these regions are merged in order to identify the temporal aspect that classify activities in a video. Our approach yields better results when compared to a frame-by-frame classification. Categories and Subject Descriptors: I.2.10 [Artificial Intelligence]: Vision and Scene Understanding; I.5.4 [Pattern Recognition]: Applications Keywords: Neural Networks, Convolutional Neural Networks, Activity Recognition

1.

INTRODUCTION

Recognizing activities in videos is an important task for humans, since it helps the identification of different types of interactions with other agents. An important application of such task refers to the surveillance and assistance of the sick and disabled: such tasks require human monitors to identify any unexpected activity. Here, activity recognition refers to the task of dealing with noisy low-level data directly from sensors [Sukthankar et al. 2014]. The automation of such task is particularly challenging in the real physical world, since it either involves fusing information from a number of sensors or inferring enough information using a single sensor. To perform such task, we need an approach that is able to process the frames of a video and extract enough information in order to determine the activity. Due to the evolution of graphics processing unit (GPU) and an increasing in computability power, deep learning methods started to be the state-of-the-art of different computer vision tasks [Karpathy et al. 2014; Chen et al. 2017; Redmon et al. 2016]. However, for activity recognition, deep learning algorithms need to consider the temporal aspect of videos since activities tend to occur through the frames. Besides that, this kind of algorithms demands a big volume of data in order to generalize when classifying unseen data. In this work, we propose an approach to compute the temporal aspect from frames by dividing a video into regions and merging features from different regions in order to recognize an activity. Using Convolutional Neural Networks (CNNs) to extract features from each frame, we train a classifier that receives as input either a concatenation or the mean of features from different regions and outputs

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

89

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

J.P. Aires, J. Monteiro, R. Granada, F. Meneguzzi, R.C. Barros

a prediction to the corresponding activity. Considering videos as a division of regions, we unify different moments of the video in a single instance, which makes it easier to classify. We perform experiments using two off-the-shelf CNNs in order to verify whether the regions work independently of the architecture. Our results indicate that the use of regions may increase the accuracy about 10% when compared with single networks that classify activities frame-by-frame. 2.

METHOD

Recognizing activities from videos often involves an analysis of how objects in frames modify along the time. An approach to recognize activities in videos involves obtaining the temporal information, which may contain descriptions about how objects are moving and the main modifications between frames. Since an activity consists of a sequence of movements, obtaining the temporal information may help systems to better understand which activity is being performed. In this work, we propose an approach to obtain temporal information from a video by dividing its frames into regions. Thus, instead of classifying an activity using only the information from each image frame, we extract and merge the information from several regions of the video in order to obtain its temporal aspect. Figure 1 illustrates the pipeline of our architecture, which is divided into four phases: pre-processing, CNNs, regions division, and classification.

Fig. 1.

Pipeline of our architecture for activity recognition using multiple regions.

1) Pre-processing: This phase consists of resizing images to a fixed resolution of 256 × 256. Resizing is important since it reduces the number of features for each image inside the CNN as well as the processing time. 2) CNNs: In this phase we train two CNNs using the DogCentric Activity dataset [Iwashita et al. 2014] in order to extract features of the activity in each frame. We first train a GoogLeNet [Szegedy et al. 2015] due to its reduced number of parameters generated by inception modules. GoogLeNet is a 22-layer deep network and is based on the inception module that contains convolutional filters with different sizes, covering different clusters of information. The second CNN is a slightly modified version of AlexNet [Krizhevsky et al. 2012] called CaffeNet [Jia et al. 2014] due to its small architecture, which is able to provide a fast training phase. Our version of the AlexNet contains 8 weight layers including 5 convolutional layers and 3 fully-connected layers, and 3 max-pooling layers following the first, second and fifth convolutional layers. Both networks receive a sequence of images as input, Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

90

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Improving Activity Recognition using Temporal Regions

·

3

passing them through several convolutional layers, pooling layers and fully-connected layers (FC), ending in a softmax layer that generates the probability of each image to each class. We use CNNs as feature extractors for images of the dataset in order to generate a feature representation of each frame. 3) Regions division: In this phase, we divide each sequence of frames of a video into n regions of the same size, i.e., containing the same number of frames, discarding the remaining ones in case they exist. To make a composition of different parts of the video, we take one frame of each region and either concatenate or take the mean of their features. For example, consider a video divided into three regions and each frame containing ten features, the resulting vector of a concatenation will contain thirty features, while the resulting vector of the mean will contain ten features. In order to avoid repetitions between the regions representation, we merge the features of the frame in the ith position of the first region with the feature of the frame in the ith + 1 position in the other regions, e.g., (a1 · b2 · c2 ) in “concatenation” in Figure 1. In case where the features of the frame are in the last position of the first region, the features of the frames in the next regions are extracted from the first position, e.g., (a4 · b1 · c1 ) in “concatenation” in Figure 1. 4) Classification: In this phase, we apply a Support Vector Machine (SVM) [Crammer and Singer 2001] on the features from the concatenation or mean phase in order to predict the activity. 3.

EXPERIMENTS

In this section, we describe the dataset used in our experiments for animal activity recognition and the implementation details used in the CNN models and SVM algorithm. 3.1 Dataset The DogCentric Activity dataset1 [Iwashita et al. 2014] consists of 209 videos containing 10 different activities performed by 4 dogs. The dataset contains first-person videos in 320 × 240 resolution (recorded at 48 frames per second) taken from an egocentric animal viewpoint, i.e., a wearable camera mounted on dogs’ back records outdoor and indoor scenes. Not all dogs perform all activities and an activity can be performed more than once by the same dog. The 10 target activities in the dataset include: “waiting for a car to pass by” (hereafter named Car ), “drinking water” (Drink ), “feeding” (Feed ), “turning dog’s head to the left” (Left), “turning dog’s head to the right” (Right), “petting” (Pet), “playing with a ball” (Play), “shaking dog’s body by himself” (Shake), “sniffing” (Sniff ), and “walking” (Walk ). It is important to mention that these egocentric (first-person) videos are very challenging due to their strong camera motion. In the left side of Figure 2, we show the dataset distribution according to its classes. As we can see, classes such as ‘Left’ and ‘Right’ have less than 10% of presence in the dataset, whereas ‘Car’ and ‘Sniff’ classes have 15% indicating that the dataset is unbalanced. In the right side of Figure 2, we illustrate examples for each class in the dataset. In order to perform experiments, we divided the dataset into training, validation, and test sets. We use the validation set to obtain the model configuration that best fits the training data, i.e., the configuration with the highest accuracy, and the test set to assess the accuracy of the selected model in unseen data. We use a method to divide the dataset that is similar to the one proposed by Iwashita et al. [2014], and consists of randomly selecting half of the videos of each activity as test set. In case where the number of videos (N ) is an odd number, we separate (N2+1) videos for test set. We divide the rest of the videos into training and validation sets. Validation set contains 20% of the videos and the rest is separated to training set. Thus, our division contains 105 videos (17,400 frames) in testing set, 20 videos (3,205 frames) in validation set and 84 videos (13,040 frames) in training set. 1 http://robotics.ait.kyushu-u.ac.jp/~yumi/db/first_dog.html

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

91

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

J.P. Aires, J. Monteiro, R. Granada, F. Meneguzzi, R.C. Barros

Fig. 2.

Frames from each class of the dataset and the distribution of frames in classes.

3.2 Model Training We implemented each part of our architecture using different toolkits: for CNNs, we use the Caffe 2 framework [Jia et al. 2014], and for the SVM, we use the Crammer and Singer [2001] implementation from scikit-learn 3 [Pedregosa et al. 2011]. Each CNN is trained by fine-tuning a network pre-trained in ImageNet [Russakovsky et al. 2015]. In training phase, both networks (GoogLeNet and AlexNet) use mini-batch stochastic gradient with momentum (0.9). For each image, we apply a random crop, i.e., a crop in a random part of the image, generating a sub-image of 224 × 224. We subtract all pixels from each image by the mean pixel of all training images. The activation of each convolution (including those within the “inception” modules in GoogLeNet) use a Rectified Linear Unit (ReLU). Each iteration in GoogLeNet uses a mini-batch with 128 images. For the weight initialization, we employ the Xavier [Glorot and Bengio 2010] algorithm, which automatically determines the value of initialization based on the number of input neurons. To avoid overfitting, we apply a dropout with a probability of 80% on the fully-connected layers. The learning rate is set to 3 × 10−4 and we drop it to 2 × 10−4 every epoch, stopping the training after 2.04k iterations (20 epochs). In AlexNet, each iteration contains a mini-batch with 64 images. We initialize the weights using the Gaussian distribution with a standard deviation of 0.01. Similar to GoogLeNet, we avoid overfitting by applying a dropout with 90% of probability to prone nodes of fully-connected layers. The learning rate is set to 10−4 and we drop it 5 × 10−4 every epoch, stopping the training after 4.08k iterations (20 epochs). 4.

RESULTS

In order to evaluate our approach, we calculate the accuracy (A), precision (P), recall (R) and Fmeasure (F). Since classification accuracy takes into account only the proportion of correct results that a classifier achieves, it is not suitable for unbalanced datasets because it may be biased towards classes with larger number of examples. Although other factors may change results, classes with a larger number of examples tend to achieve better results since the network has more examples to learn the variability of the features. By analyzing the DogCentric dataset, we notice that it is indeed unbalanced, i.e., classes are not equally distributed over frames. Thus, we decided to calculate 2 http://caffe.berkeleyvision.org/ 3 http://scikit-learn.org/

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

92

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

·

Improving Activity Recognition using Temporal Regions

5

precision, recall, and f-measure to make a better evaluation over the unbalanced dataset. We calculate scores considering all classes presented in the test set as explained in Section 3.1. Table I shows the results achieved by AlexNet and GoogLeNet in our experiments when splitting the dataset into regions. The best results for concatenation and mean are in bold. It is important to note that Regions=1 is equivalent to compute single frames individually without splitting the video, i.e., the traditional approach that do not encode the temporal aspect of an activity. 1) Overall Performance: As presented in Table I, when splitting the video into temporal regions we increase the accuracy of the network. For AlexNet, we achieve the best accuracy in two situations: (a) by using 16 regions and the mean of the feature vectors (59% of accuracy); and (b) using 32 regions with the concatenation of the vectors (59% accuracy). In terms of precision, recall, and fmeasure, AlexNet achieves the best results when using 32 regions and the concatenation of vectors. For GoogLeNet, we obtain the best accuracy for both 8 and 16 regions, achieving 62% of accuracy using concatenated vectors, while the highest precision is achieved by the concatenation of vectors using 16 regions (72% of precision). In general, AlexNet achieves the best scores when increasing the number of regions to 32, while GoogLeNet achieves the highest scores using 8 regions. Overall, GoogLeNet obtains better results for all metrics. This occurs due to the set of layers in GoogLeNet, such as inception that allows the extraction of more complex features from data. 2) Mean vs. Concatenation: In general, the concatenation and the mean of the regions achieve close results for most regions. However, when increasing the number of regions the concatenation surpasses the mean and achieves the best results. On the other hand, the concatenation has the drawback that the more the number of regions, the more memory the system needs, while the mean keeps the memory constant. 3) Single vs. Multiple Regions: When comparing the architectures using single frames or multiple regions, we can observe that even when using two regions, the performance increase in both AlexNet and GoogLeNet. Although some activities can be identified in a single frame, it usually takes several frames to identify an activity. Thus, it seems reasonable that using more than a single frame will lead to better results. For a better understanding of how networks are classifying each class using a single frame and a set of regions, we illustrate in Figure 3 the confusion matrices generated for single frames and for the concatenation of 32 regions (the best results) for AlexNet, and the confusion matrices generated by single frames and the concatenation of 8 regions for GoogLeNet. As we can see in Figure 3, there is an increase in precision for bot Left and Shake classes when splitting the dataset into regions to predict the class of the video using AlexNet. Comparing the results from GoogLeNet, there is an increase in precision for both Right and Shake classes when splitting the dataset into eight regions. Other classes are also positively affected by the region division as it works as a reinforcement to the network classification. Regarding the number of regions, we notice that our results start to obtain significant increase with 8 regions or more. There is a difference between GoogLeNet and AlexNet

Table I. Accuracy (A), Precision (P), Recall (R) and F-measure (F ) achieved for AlexNet and GoogLeNet when using regions in the DogCentric Activity dataset. Regions

Fusion

A

1 2

– Concat Mean Concat Mean Concat Mean Concat Mean Concat Mean

0.51 0.55 0.55 0.56 0.57 0.56 0.58 0.57 0.59 0.59 0.58

4 8 16 32

AlexNet P R 0.57 0.59 0.59 0.61 0.61 0.61 0.62 0.62 0.63 0.66 0.63

0.52 0.55 0.55 0.57 0.57 0.57 0.58 0.57 0.59 0.59 0.59

F

A

0.52 0.55 0.55 0.56 0.57 0.56 0.57 0.57 0.58 0.59 0.58

0.56 0.58 0.58 0.60 0.60 0.62 0.59 0.62 0.58 0.60 0.58

GoogLeNet P R 0.62 0.64 0.64 0.67 0.67 0.71 0.67 0.72 0.67 0.71 0.69

0.56 0.58 0.58 0.60 0.60 0.63 0.59 0.62 0.59 0.61 0.58

F 0.56 0.58 0.58 0.60 0.60 0.63 0.59 0.61 0.59 0.60 0.60

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

93

5th KDMiLe – Proceedings

·

6

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

J.P. Aires, J. Monteiro, R. Granada, F. Meneguzzi, R.C. Barros

number of regions that yield their best results. While GoogLeNet obtains its best results with 8 and 16 regions, AlexNet obtains with 32 regions. We attribute this to the fact that AlexNet needs more features (regions) from different parts of the video to obtain good results, while GoogLeNet does not need that much since it has a more complex structure.

0.0

0.1 Fig. 3.

0.2

0.3

0.4

0.5

0.6

0.7

GoogLeNet: 8 Regions

C Dr ar in Fe k e Ld Ri eft gh t Pe Pla t Sh y ak Sn e W iff alk

GoogLeNet: 1 Region

C Dr ar in Fe k e Ld Ri eft gh t Pe Pla t Sh y ak Sn e W iff alk

AlexNet: 32 Regions

C Dr ar in Fe k e Ld Ri eft gh t Pe Pla t Sh y ak Sn e W iff alk

AlexNet: 1 Region

C Dr ar in Fe k e Ld Ri eft gh t Pe Pla t Sh y ak Sn e W iff alk

Car Drink Feed Left Right Pet Play Shake Sniff Walk

0.8

0.9

Confusion matrix for AlexNet using 1 and 32 regions and GoogLeNet using 1 and 8 regions.

From all the confusion matrices, we can observe that the classes Car and Sniff have the highest scores in classification. The highest scores indicate that the classes are very different of the other classes and the features can be well separated from others. In fact, the class Car, for example, is the only class that contains a car passing by the dog. Besides, both classes have most number of samples in the dataset, which facilitates the network to learn their features. 4) Unbalanced classes: Analyzing the dataset we observe that it is indeed unbalanced, with the classes Car and Sniff having the largest number of frames (each containing ≈ 15% of the frames of the dataset). On the other hand, Left class is only ≈ 4% of the total frames in the dataset and Right class is ≈ 6% of the total frames. This unbalanced nature of the dataset seems to impact the results, since the lower the number of frames the class has, the lower the accuracy score it achieves. 5.

RELATED WORK

The increasing availability of wearable devices such as cameras, Google Glass4 , Microsoft SenseCam5 , etc. facilitates the low-cost acquisition of rich activity data. In recent years, this increase of available egocentric data have attracted a lot of attention in the computer vision community. The first-person point-of-view (or egocentric) is very popular in the study of daily activities [Ryoo et al. 2015; Hsieh et al. 2016], and many work have been proposed. For example, Ma et al. [2016] propose a method that uses twin-stream networks, one to capture the appearance of the objects such as hand regions and objects attributes, and the other to capture motion such as local hand movements and global head motion using stacked optical flow fields as input. A late fusion method joins both networks. Their experiments use GTEA and Gaze [Fathi et al. 2012] egocentric datasets. Using first-person images and Inertial Measurement Units (IMU) sensors, Spriggs et al. [2009] propose an approach to make action segmentation and classification from brownie recipe videos. They use the Carnegie Mellon University Multimodal Activity dataset and manually annotate 29 actions in the videos. In order to segment actions, the authors use the data from IMU sensors as input for unsupervised algorithms that divide the video according to the actions. To classify actions, they test two supervised algorithms: Hidden Markov Models and K-Nearest Neighbours. 4 https://www.google.com/glass/start/ 5 http://research.microsoft.com/en-us/um/cambridge/projects/sensecam/

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

94

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Improving Activity Recognition using Temporal Regions

·

7

Pirsiavash and Ramanan [2012] propose a two-step approach to identify actions from first-person videos. First, they identify the objects in the frames and classify those been used by the person as active. Then, they use the information from the objects to classify the activity been performed. The authors use a manually annotated dataset containing 1 million frames with activities of daily living recorded from an “egocentric” point of view. They annotated 18 action classes and 42 identified objects. To identify activities, they use models considering the presence of objects and when they are active. Ryoo and Matthies [2013] introduce an approach to recognize interaction-level human activities from an egocentric point of view. First, they extract features from first-person videos, such as global and local motion descriptors. Then, they cluster similar descriptors and create a multi-channel kernels to compute video similarities. Using such representations, they train an SVM to classify the videos according to the correct activity. Iwashita et al. [2014] focus on recognizing activities not only using wearable cameras attached to a person but also to animals, from DogCentric Activity dataset. They present several approaches to extract features, such as dense optical flow, binary pattern, cuboid feature detector and STIP detector. Using these hand-crafted features combined with a SVM they achieve the highest classification accuracy of 60.5%. Unlike Iwashita et al. , we do not use hand-crafted features to identify the activity, but instead, we train a end-to-end CNN. Ryoo et al. [2015] develop a new feature representation named pooled time series (PoT), which intends to capture motion information in first-person videos by applying time series pooling of feature descriptors, detecting short-term/long-term changes in each descriptor element. Activity recognition is performed by training and testing a Support Vector Machine (SVM) classifier using these features, achieving 74% of accuracy when combined with INRIA’s improved trajectory feature (ITF) [Wang and Schmid 2013] in the DogCentric Activity dataset. 6.

CONCLUSIONS AND FUTURE WORK

In this work, we use features from single frames generated by a CNN to create temporal regions. Using such regions, we induce the temporal aspect of videos in order to recognize actions. Our model architecture is composed by a CNN (AlexNet or GoogLeNet), a method to separate and compute regions, and a classifier (SVM). Using images from the DogCentric Activity dataset, we perform experiments showing that our approach can improve the activity recognition task. We test our approach using two networks AlexNet and GoogLeNet, increasing up to 10% of Precision when using regions to classify activities. For future work, we plan to test our approach in other datasets to verify its performance in different domains, as well as use other CNN architectures to ensure that our temporal regions can improve other models as well. Finally, we intend to compare our approach with other methods that consider the temporal aspect of the video, such as Long-Short Term Memory (LSTM) [Hochreiter and Schmidhuber 1997] and 3D CNN [Ji et al. 2013].

ACKNOWLEDGMENT

This paper was achieved in cooperation with HP Brasil Indústria e Comércio de Equipamentos Eletrônicos LTDA. using incentives of Brazilian Informatics Law (Law no 8.2.48 of 1991).

REFERENCES Chen, H., Chen, J., Hu, R., Chen, C., and Wang, Z. Action recognition with temporal scale-invariant deep learning framework. China Communications 14 (2): 163–172, 2017. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

95

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

J.P. Aires, J. Monteiro, R. Granada, F. Meneguzzi, R.C. Barros

Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research vol. 2, pp. 265–292, 2001. Fathi, A., Li, Y., and Rehg, J. M. Learning to recognize daily actions using gaze. In European Conference on Computer Vision. Springer-Verlag, Berlin, Heidelberg, pp. 314–327, 2012. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In AISTATS’10. Vol. 9. PMLR, Chia Laguna Resort, Sardinia, Italy, pp. 249–256, 2010. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation 9 (8): 1735–1780, 1997. Hsieh, P.-J., Lin, Y.-L., Chen, Y.-H., and Hsu, W. Egocentric activity recognition by leveraging multiple mid-level representations. In 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, Seattle, WA, USA, pp. 1–6, 2016. Iwashita, Y., Takamine, A., Kurazume, R., and Ryoo, M. S. First-person animal activity recognition from egocentric videos. In 2014 22nd International Conference on Pattern Recognition (ICPR). IEEE, Stockholm, Sweden, pp. 4310–4315, 2014. Ji, S., Xu, W., Yang, M., and Yu, K. 3d convolutional neural networks for human action recognition. TPAMI 35 (1): 221–231, 2013. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In 22nd ACM international conference on Multimedia. ACM, New York, NY, USA, pp. 675–678, 2014. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. CVPR ’14. IEEE Computer Society, Washington, DC, USA, pp. 1725–1732, 2014. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS’12. Curran Associates Inc., USA, pp. 1097–1105, 2012. Ma, M., Fan, H., and Kitani, K. M. Going deeper into first-person activity recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, Nevada, USA, 2016. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research vol. 12, pp. 2825–2830, 2011. Pirsiavash, H. and Ramanan, D. Detecting activities of daily living in first-person camera views. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Providence, RI, USA, pp. 2847–2854, 2012. Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, Nevada, USA, pp. 779–788, 2016. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. IJCV 115 (3): 211–252, 2015. Ryoo, M. S. and Matthies, L. First-person activity recognition: What are they doing to me? In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. Ryoo, M. S., Rothrock, B., and Matthies, L. Pooled motion features for first-person videos. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Boston, MA, USA, pp. 896–904, 2015. Spriggs, E. H., Torre, F. D. L., and Hebert, M. Temporal segmentation and activity classification from first-person sensing. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Miami, FL, USA, pp. 17–24, 2009. Sukthankar, G., , Geib, C., , Bui, H. H., , Pynadath, D. V., , and Goldman, R. P. Plan, Activity, and Intent Recognition. Morgan Kaufmann, Boston, 2014. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Boston, MA, USA, 2015. Wang, H. and Schmid, C. Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, Sydney, NSW, Australia, pp. 3551–3558, 2013.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

96

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Label Powerset for Multi-label Data Streams Classification with Concept Drift J. D. Costa Junior1 , E. R. Faria2 , J. A. Silva3 , R. Cerri1 1

Department of Computer Sciences - Federal University of São Carlos, Brazil joel.costa,[email protected] 2 School of Computer Science - Federal University of Uberlândia, Brazil [email protected] 3 Federal University of Mato Grosso do Sul, Brazil [email protected]

Abstract. Data Streams are unlimited data sequences, continuously generated from non-stationary distribution, arriving at high speed. In several applications, Data Stream examples can be associated with a set of labels. When it is necessary to classify such data, the task is called multi-label data stream classification. Learning in evolving streaming scenarios is challenging, as classifiers must be able to deal with a huge number of examples, and to adapt themselves to changes using limited time and memory, while performing predictions at any point. Considering these challenges, this work proposes a method for Multi-Label Classification in Data Stream scenarios, able to adapt itself to detect Concept Drifts. We extend a method for Data Stream Classification called MINAS. Since MINAS does not deal with multi-label data, we adapt it using a problem transformation method called Label Powerset. The experiments showed that our method presented promising results compared to the literature. Using artificial data, we show that our method is capable of adapting to concept drifts achieving good results. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications—Data Mining; I.2.6 [Artificial Intelligence]: Learning Keywords: Concept drift, Data stream, Label powerset, Multi-label classification

1.

INTRODUCTION

Data Streams (DS) are unlimited data sequences, continuously generated from non-stationary distribution, arriving at high speed [Aggarwal 2007]. Also, these data can not be stored in memory, which requires an example to be processed only once. Another important characteristic in DS is Concept Drift, where the concept about which data is being collected may shift from time to time [Gama 2010]. Classification is a Machine Learning task consisting of finding common properties of the data set examples, and categorizing them into different classes [Fan and Li 1998]. To build DS classifiers, the following restrictions should be considered [Nguyen et al. 2015; Gama 2010]: —Real-Time Analysis: the decision model must be quick and accurate to make decisions; —One-shot learning: the examples should be analyzed and then discarded; —Dynamic models: decision models have to evolve in correspondence with the dynamic environment; —Finite Computational resources: algorithms have to use limited computational resources (in terms of computations, memory, space and time). Only a summary of the data must be stored in memory. The authors would like to thank CAPES and FAPESP for their financial support, specially the grant #2015/14300-1 São Paulo Research Foundation (FAPESP). c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

97

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Costa Junior et. al.

Fig. 1.

Illustration of the LP transformation. Adapted from [Tsoumakas et al. 2009]

In Multi-Label (ML) Classification [Tsoumakas et al. 2009] tasks, examples can be associated with a set of labels. Consider the task of document categorization. Each document can simultaneously belong to more than one topic or label: a document can be classified as belonging to Computer Science, Physics and Application, another document can be assigned to the areas of Biology and Theory, and a third can be related to Mathematics and Physics. This problem would then have at least six classes or labels (Computer Science, Physics, Application, Biology, Theory and Mathematics) [de Carvalho and Freitas 2009]. A traditional approach widely used in multi-label learning is the Label Powerset (LP) transformation (or Label Combination) [Tsoumakas et al. 2009], which consists of transforming a multi-label problem into a single-label multi-class problem. In the transformed problem, each combination of labels presented in the original dataset is transformed into a single label. Despite the disadvantage of being the worst-case computational complexity (involving 2L classes in the transformed multi-class problem), the LP is simple, consider label correlations and after transformation any multi-class algorithm can be used for classification. Figure 1 illustrates an example of the LP strategy. Although classification tasks involving DS are, mostly, naturally multi-label, there are few works dealing with this problem. Besides, it is not yet known how well various multi-label approaches work in such a context, and there are not many data sets in the literature [Read et al. 2012]. In this work, we extend MINAS (MultIclass learNing Algorithm for DSs) [de Faria et al. 2016], addressing the DS ML classification task, while dealing with concept drift. In the initial training phase, MINAS builds a decision model based on micro-clusters of labeled examples. In the online phase, new examples are classified using this model, or marked as unknown. Groups of unknown examples can be used later to create new micro-clusters, which are used to update the decision model in order to reflect changes in the known classes. The main contributions of this work are: —A new method to handle ML classification in DS, which adapts to Concept; —The use of only one decision model (composed of different groups) representing the combination of labels in a multi-label problem; —Sets of unlabeled examples, not explained by the current model, are used to represent extensions of known concepts, thus making the decision model dynamic. The remainder of this paper is structured as follows. Section 2 surveys related work. Section 3 describes in details how the LP strategy was used in MINAS. Section 4 presents the experimental results, together with the data sets, literature methods, and parameters used in the experiments. Finally, we summarize and draw conclusions in Section 5, together with future research directions. 2.

RELATED WORK

This section presents the main works related to ML Learning and DS. First we present the traditional methods for multi-label classification and then the ML classification methods in DS. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

98

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Label Powerset for Multi-label Data Streams Classification with Concept Drift

·

3

The literature has different approaches to solve problems involving ML data. A traditional one that is widely used is Problem Transformation Methods [Tsoumakas et al. 2009]. In this approach, a multi-label classification problem is usually dealt with by transforming the original problem into a set of single-label problems [de Carvalho and Freitas 2009]. This approach is interesting because of its flexibility and general applicability. In fact, all methods in the literature use, mention or compare to at least one of them [Read et al. 2012]. Following we present the main algorithms based on Problem Transformation. The Binary Relevancy (BR) is the most popular Problem Transformation method. It divides the multi-label problem into several simple-label binary problems [Tsoumakas et al. 2009]. However, BR does not directly model correlations which exist between labels thus detracting your performance, and in DS the class-label imbalance may become exacerbated by large numbers of training examples [Read et al. 2012]. Another example of the problem transformation approach is the Label Powerset (LP) method. It transforms the multi-label problem into a simple-label multi-class problem, where each combination of labels present in the data set is transformed into a different and unique label. Its main advantage is to consider the dependence between labels. However, its disadvantages include worst case of computational complexity [Tsoumakas et al. 2009]. [Read et al. 2012] highlights that in a DS context, a challenge for LP is that the label space expands over time due to the emergence of new label combinations. An alternative to the LP method is Pruned Sets (PS) [Read et al. 2008]. In PS, the less frequent label combinations are eliminated in order to reduce the complexity of the problem. In the same work, the Ensemble of Pruned Sets (EPS) algorithm was also proposed. It constructs a number of PS by sampling the training sets (i.e., bootstrap). The authors believe that method can help to avoid the overfitting on the pruning process. The disadvantages of this method is that it can not predict combinations of labels not seen in the training set [Cherman 2013]. [Read et al. 2011] proposed the Classifier Chains (CC) method, where the main idea is to add label dependency to the BR method. CC transforms a multi-label problem into a chain of binary classifiers, in which a classifier is built taking into account the predictions of previous classifiers. An important characteristic of problem transformation is the ability to use any off-the-shelf singlelabel base classifier to suit requirements. With this, using incremental base classifiers, it is possible to treat Data Streams [Read et al. 2012]. The MEKA framework 1 includes these methods, and also ensemble learning methods whose the main idea is to use several models under a voting scheme, trying to achieve superior predictive performance. When used in an ensemble scheme, any incremental method can adapt to concept drift. [Read et al. 2012] use ADWIN Bagging [Bifet et al. 2009] with the online bagging method [Oza and Russell 2001] to detect such changes. Other significant algorithms for Multi-label Classification in Data Stream are presented below. [Xioufis et al. 2011] deal with changes in concept and unbalanced classes using two fixed size windows: one for positive examples and one for negative examples. Besides that, they used BR with the KNN algorithm as the base classifier. [Wang et al. 2012] proposed an ensemble of classifiers which deals with the multi-label problem in DS with limited labeling resource. [Read et al. 2012] proposed Multi-label Hoeffding Tree. Splitting of each node is based on the calculation of the Multilabel Information Gain,and uses Pruned Set Transformation with Naive Bayes as the classifier base on its leaf. In [Shi et al. 2014], the authors proposed an improvement in the Multi-label Hoeffding Tree. They proposed an extra filtering method in order to improve the performance of the model, by choosing to train it only with the instances that contain the most frequent combinations of labels. Incremental class learning strategies are used by the authors. An extension of this work that is able to detect concept drift is proposed in [Shi et al. 2014]. 1 http://meka.sourceforge.net/

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

99

5th KDMiLe – Proceedings

4

3.

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Costa Junior et. al.

PROPOSAL: MULTI-LABEL MINAS WITH LABEL POWERSET

3.1 Formalization of the problem [Tsoumakas et al. 2009] formalizes the multi-label classification task as follows: L = {yj : 1...l} represents a finite set of labels, D = {(Xi , Yi ), i = 1...n} represents a set of examples, where Xi is a vector of attributes and Yi ⊆ L the set of labels assigned to the ith example. The LP method transforms each Yi in a single and unique label. As an example, if Y1 = {y1 , y3 } and Y2 = {y1 , y2 , y3 }, single-labels representing these subsets can be denoted as y1,3 and y1,2,3 , respectively. After the LP transformation, the transformed set of examples is represented as D = {(X1 , yLP 1 ), (X2 , yLP 2 ), . . . , (Xn , yLP n )}, with LLP = {yLP 1 , yLP 2 , ..., yLP m } the new set of target labels. Here, m is the number of unique subsets of labels present in the D. After training, a decision model is created, representing the known concept. Because DS examples can be multi-label, the goal in classifying multi-label data streams is to classify a new example Xnew in one of the classes from the set LLP . If yLPnew ∈ LLP , we have a classical multi-class classification problem. Otherwise, a consistent micro-cluster of examples must be analyzed to identify if they that can be used to update the decision model. 3.2 Original MINAS algorithm This section introduces MINAS, algorithm for novelty detection in data streams multi-class problems. MINAS divides the learning process into two phases: offline and online. In offline or initial training phase, MINAS builds a decision model based on a labeled data set. In the online phase, new examples are classified using this model, or marked as unknown. Micro-clusters of unknown examples can be later used to create valid novelty patterns, which are added to the current decision model. To build the decision model, MINAS receives a set of labeled data containing examples of one or more classes. The set is separated into different subsets, one for each class, after that, a clustering algorithm (Kmeans ou CluStream) runs on each of these subsets separating them into k-micro-clusters. Therefore, each class of the problem is represented by a set of micro-clusters. The idea of creating a classifier composed of a set of micro-clusters is that these micro-clusters can evolve over time. As new unlabeled data arrives, it can be either sorted or rejected by the actual decision model. The examples classified by the decision model can be used to update it. However, those that were rejected receive an unknown label, and are sent to a short-time-memory for further analysis. From time to time, the unknown examples are used to model new micro-clusters. The obtained micro-clusters are validated in order to discard the non-cohesive or non-representative ones. Valid micro-cluster are analyzed in order to decide whether they represent extensions of the known concepts or novelty patterns. In both cases, these micro-clusters are used to update the decision model. With this MINAS can adapt to concept drift and concept evolution. 3.3

Proposed MINAS multi-label extension

In this section, we present our method for multi-label classification in Data Stream using a adaptation of the MINAS algorithm. Knowing the effectiveness of the problem transformation algorithms, we use the LP method to transform the multi-label dataset in a single-label multi-class dataset and, with this, create micro-clusters and build a decision model that can be continuously updated. Offline Phase: The supervised learning phase (or offline phase or training phase) is performed only once and represents the learning of known concepts. This phase receives as input a labeled data set containing examples from different classes. The number of label combinations (Label Powerset) is calculated, and only combinations with a sufficient number of positive examples are kept in the data Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

100

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Label Powerset for Multi-label Data Streams Classification with Concept Drift

·

5

set. The remaining examples are selected and transformed into simple-labels using the LP method. This training data is divided into subsets, each one representing one class of the problem (one Label Powerset). The KMeans algorithm is used in each subset to create k-micro-clusters, representing each class. In this work, as in MINAS, each micro-cluster is represented by four-tuple: Number of micro-cluster examples (N ), linear sum of the N micro-cluster examples (LS), squared sum of the N micro-cluster examples (SS), and timestamp of the arrival of the last example in the micro-cluster. With this information, it is possible to calculate measures such as the centroid of a micro-cluster and its radius. Micro-clusters have the properties of incrementality and additivity, making possible the evolution of the decision model over time without the need of its total reconstruction. Besides this information, each micro-cluster receives a label that represents the subset of classes to which this micro-cluster is associated. The initial decision model is composed of the union of the micro-cluster obtained for each class. Using the micro-clusters information, we can also calculate the centroid and the maximum limit of each cluster. With this information, if the distance of a new example to the centroid is smaller than the maximum limit of the micro-cluster, it will belong to that micro-cluster. Online Phase: In this phase, examples that arrive along the stream are labeled using the current decision model. This phase is composed by three operations: classify new examples, detect concept drifts, and update the decision model. For each new example, the distance between the example and the centroid of the closest micro-cluster is calculated. We used Euclidean Distance for numeric attributes and Hamming distance for nominal attributes. If the distance from the nearest one is smaller than a threshold, then the example is added to this micro-clusters and receives the label associated with it, otherwise it is marked as unknown and sent to short-time-memory. To classify an example as unknown means to say that the current decision model does not have enough information to classify it. To detect concept drift, after a while a KMeans algorithm is executed using the unknown examples stored in the short-time memory in order to discover new clusters. Each new cluster is evaluated to verify if it represents a valid cluster. A cluster is only valid if it has a minimum number of examples, which is a value set by the user. If a valid cluster is detected, its examples are labeled with the label of its nearest cluster. These new clusters are used to update the decision model. 4.

EXPERIMENTS AND RESULTS

4.1 Data sets Table I presents a summary of the main characteristics of the data sets used in the experiments2 . We chose large multi-label real-world datasets: TMC2007, IMDB and MediaMill. We treat these data sets as a stream by processing their examples in the order that they appear in the data sets. For synthetic data, we use two types of generation schemes, listed below. Both generators are available in the MOA (Massive Online Analysis) framework [Bifet et al. 2010] 3 . —Random Tree Generator (RTG): produces concepts that in theory should favour decision tree learners; —The Radial Basis Function (RBF): effectively creates a normally distributed hyper-sphere of examples surrounding each central point with varying densities. To simulate the concept drift in the synthetic data sets we use the framework3 proposed in [Read et al. 2012] that allow us to simulate changes in label cardinality and label dependencies of datasets. 2 sourceforge.net/projects/meka/files/Datasets/

and mulan.sourceforge.net/datasets-mlc.html

3 https://moa.cms.waikato.ac.nz/

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

101

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Costa Junior et. al. Table I. Name

Characteristics of the data sets

# Examples

# Numeric

# Nominal

# Labels

Cardinality

28596 120919 43907 1E5 1E6 1E5

0 0 120 80 0 80

500 1001 0 0 30 0

22 28 101 25 8 25

2.2 2.0 4.4 2.8 1.6 1.5 → 3.5

TMC2007 IMDB MediaMill SynG-RBF SynT-RTG Drift-SynG-RBF

In the first drift, 10% of label dependencies change. In the second, more labels are associated on average with each example, in other words, the label cardinality was increased. In the third drift, 20 % of label dependencies change. Such types and magnitudes of drift can be found in real world data [Read et al. 2012]. We processed the data sets in windows. The concept drifts take place at positions N/1000, N/100 and N/10 in the data sets, where N is the number of examples. 4.2 Literature Methods We compared our proposal with several incremental problem transformation algorithm from the literature. The algorithms used are listed below. —Binary Relevance-based methods: these methods are composed of binary classifiers; one for each label. It is straightforward to apply BR to a data stream context. The BRUpdatable and CCUpdatable [Read et al. 2012] methods were used; —Label Powerset-based methods: basic LP must discard any new training examples that have a labelset combination that is not one of the class labels. The PS method from [Read et al. 2008] is an LP method much better suited to this incremental learning context [Read et al. 2012]; —Ensemble Methods: we consider ADWIN Bagging [Bifet et al. 2009], which is the online bagging method of [Oza and Russell 2001] with the addition of the ADWIN algorithm as a change detector. We used m = 10 BR (EaBR) classifiers in the ensemble. All the classification algorithms are implemented within the MEKA framework. Considering our proposed method, it was implemented using both the MOA and MEKA frameworks. For all algorithms, their default parameter values were used. 4.3

Evaluation Measures

Unlike single-label classification, wherein an instance is classified either correctly or wrongly, in multi-label classification, a classification can be considered partially correct or partially wrong, requiring specific evaluation measures. Let H be a ML classifier, with Zi = H(Xi ) the set of labels assigned by H to a given example Xi ; Yi the true label set, L the total set of labels, and D the set of test examples. Precision (Pr) (Equation 1) and Recall (Re) (Equation 2) are metrics that can be used for evaluation. These measures alone are not adequate to evaluate a classifier, thus we used the F-Measure (Equation 3), an harmonic mean of Pr and Re. Another measure that can be used is Hamming-loss (Equation 4), which represents the symmetrical difference (∆) between two sets of labels. The lower the Hamming-Loss value is, the better is the model.

|D|

1 X |Yi ∩ Zi | P r(H, D) = |D| i=1 |Zi |

(1)

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

102

|D|

1 X |Yi ∩ Zi | Re(H, D) = |D| i=1 |Yi |

(2)

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Label Powerset for Multi-label Data Streams Classification with Concept Drift Table II. DataSets

PS

BR

CC

EaBR

EPS

TMC2007 IMDB MediaMill SynG-RBF SynT-RTG Drift-SynG-RBF

0.355 0.241 0.435 0.179 0.496 0.152

0.565 0.169 0.421 0.185 0.490 0.152

0.547 0.019 0.434 0.012 0.373 0.047

0.548 0.018 0.434 0.001 0.513 0.110

0.560 0.169 0.093 0 0.331 0.048

0.565 0.422 0.185 0.495 0.152

Average

0.310

0.330

0.239

0.271

0.200

0.303

Table III.

Hamming Loss Results

MINAS-LP

PS

BR

CC

EaBR

EPS

TMC2007 IMDB MediaMill SynG-RBF SynT-RTG Drift-SynG-RBF

0.093 0.083 0.043 0.182 0.345 0.213

0.074 0.090 0.048 0.185 0.349 0.213

0.080 0.075 0.036 0.109 0.298 0.180

0.080 0.075 0.036 0.107 0.333 0.197

0.079 0.090 0.675 0.107 0.292 0.182

0.075 0 0.048 0.185 0.346 0.213

Average

0.160

0.160

0.130

0.138

0.238

0.145

F measure(H, D) = 2

P r ∗ Re P r + Re

7

Fmeasure Results

MINAS-LP

DataSets

·

(3)

H − Loss(H, D) =

|D|

1 X |Yi ∆Zi | |D| i=1 |L|

(4)

We used prequential methodology, that evaluates a classifier on a stream by testing and training with each example in sequence. It may use a sliding window or a fading factor forgetting mechanism [Gama et al. 2009]. For the test we used a window w = N/20 (i.e. the full stream is divided into 20 windows) and calculated the average of the measures in the data windows. This experiment design is similar to the one used by [Read et al. 2012]. 4.4 Experimental Results Table II shows the F-measure results obtained in our experiments. We refer to our method as MINAS-LP. The best results are highlighted. As can be seen, considering the average and individual data set results, the performance of MINASLP can be considered competitive with the performances of the other literature methods. Our method could not obtain better F-measure average results than the PS algorithm, neither Hamming-loss average than the BR algorithm. However, the results obtained can be considered promising, specially considering that there is much that we can explore to improve the algorithm. One modification that can improve the results is the inclusion of a mechanism to clear the short-time-memory in order to eliminate the examples marked as unknown, which were not used in the creation of new micro-cluster. This can improve the concept drift detection and also eliminating noise and outliers. Another improvement that can be implemented is related to the concept evolution detection. For this, we can investigating new strategies to transform the multi-label data sets, like Binary Relevancy based methods. 5.

CONCLUSIONS AND FUTURE WORKS

In this paper we presented a extension of MultIclass learNing Algorithm for novelty detection in data Stream (MINAS) for multi-label classification in Data Stream, which uses a problem transformation Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

103

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Costa Junior et. al.

method Label Powerset. In the Offline phase we separete examples that has a frequent combinations of labels and transforms in a single-label class that represents these subsets of labels. Each one of these class is represented by a set of clusters that are later used to classify new examples. In the Online phase examples not explained by the current model are classified as unknown. A representative set of unknown examples is used to discover clusters that represents extensions to the known concepts. These extensions are used to update the model and treat concept drits. The experiments showed that our method presented promising results with the literature methods investigated. Using an artificial data set our method was capable to adapt the concept drifts achieving good results. Results were conclusive. Our method performed efficiently, specially considering that there is much that we can explore to improve the algorithm. This work opens up several perspectives for future works, in future steps, we intend on investigating new strategies to transform the multi-label data set, like Binary Relevancy method or Paiwise method; more multi-label classification algorithms in Data Stream should be used in the experimental comparisons; the investigation of non-spherical clustering techniques to better represent the classes; look more closely at the evaluation methodology; and mainly, treat concept evolution. REFERENCES Aggarwal, C. C. Data streams: models and algorithms. Vol. 31. Springer Science & Business Media, 2007. Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. Moa: Massive online analysis. Journal of Machine Learning Research 11 (May): 1601–1604, 2010. Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., and Gavaldà, R. New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 139–148, 2009. Cherman, E. A. Aprendizado de máquina multirrótulo: explorando a dependência de rótulos e o aprendizado ativo. Ph.D. thesis, Universidade de São Paulo, 2013. de Carvalho, A. C. and Freitas, A. A. A tutorial on multi-label classification techniques. In Foundations of Computational Intelligence Volume 5. Springer, pp. 177–195, 2009. de Faria, E. R., de Leon Ferreira, A. C. P., Gama, J., et al. Minas: multiclass learning algorithm for novelty detection in data streams. Data Mining and Knowledge Discovery 30 (3): 640–680, 2016. Fan, J. and Li, D. An overview of data mining and knowledge discovery. Journal of Computer Science and Technology 13 (4): 348–368, 1998. Gama, J. Knowledge discovery from data streams. CRC Press, 2010. Gama, J., Sebastião, R., and Rodrigues, P. P. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 329–338, 2009. Nguyen, H.-L., Woon, Y.-K., and Ng, W.-K. A survey on data stream clustering and classification. Knowledge and information systems 45 (3): 535–569, 2015. Oza, N. C. and Russell, S. Online ensemble learning. University of California, Berkeley, 2001. Read, J., Bifet, A., Holmes, G., and Pfahringer, B. Scalable and efficient multi-label classification for evolving data streams. Machine Learning 88 (1-2): 243–272, 2012. Read, J., Pfahringer, B., and Holmes, G. Multi-label classification using ensembles of pruned sets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, pp. 995–1000, 2008. Read, J., Pfahringer, B., Holmes, G., and Frank, E. Classifier chains for multi-label classification. Machine learning 85 (3): 333–359, 2011. Shi, Z., Wen, Y., Feng, C., and Zhao, H. Drift detection for multi-label data streams based on label grouping and entropy. In Data Mining Workshop (ICDMW), 2014 IEEE International Conference on. IEEE, pp. 724–731, 2014. Shi, Z., Wen, Y., Xue, Y., and Cai, G. Efficient class incremental learning for multi-label classification of evolving data streams. In Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE, pp. 2093–2099, 2014. Tsoumakas, G., Katakis, I., and Vlahavas, I. Mining multi-label data. In Data mining and knowledge discovery handbook. Springer, pp. 667–685, 2009. Wang, P., Zhang, P., and Guo, L. Mining multi-label data streams using ensemble-based active learning. In Proceedings of the 2012 SIAM international conference on data mining. SIAM, pp. 1131–1140, 2012. Xioufis, E. S., Spiliopoulou, M., Tsoumakas, G., and Vlahavas, I. Dealing with concept drift and class imbalance in multi-label stream classification, 2011.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

104

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Using Scene Context to Improve Object Recognition Leandro P. da Silva1 , Roger Granada1 , Juarez Monteiro1 , Duncan D. Ruiz2

1

Faculdade de Informática Pontifícia Universidade Católica do Rio Grande do Sul Av. Ipiranga, 6681, 90619-900, Porto Alegre, RS, Brazil Email: {leandro.silva.007, roger.granada, juarez.santos}@acad.pucrs.br 2 Email: [email protected]

Abstract. Computer vision is the science that aims to give computers the capability of seeing the world around them. Among its tasks, object recognition intends to classify objects and to identify where each object is in a given image. As objects tend to occur in particular environments, their contextual association can be useful to improve the object recognition task. To address the contextual awareness on object recognition task, our approach aims to use the context of the scenes in order to achieve higher quality in object recognition, by fusing context information with object detection features. Hence, we propose a novel architecture composed of two convolutional neural networks based on two well-known pretrained nets: Places365 and Faster R-CNN. Our two-streams architecture uses the concatenation of object features with scene context features in a late fusion approach. We perform experiments using PASCAL VOC 2007 and MS COCO public datasets, analyzing its performance in different values of intersection over union. Results show that our approach is able to raise in-context object scores, and reduces out-of-context objects scores. Categories and Subject Descriptors: I.2.10 [Artificial Intelligence]: Vision and Scene Understanding; I.2.6 [Artificial Intelligence]: Learning Keywords: convolutional neural networks, neural networks, object recognition

1.

INTRODUCTION

Human brains were designed to understand the visual world with ease, receiving information about objects, contexts and their association since we born. Observing a scene context, a human can easily recognize objects that belong to that context while requiring a longer time to identify out-of-context objects. As described by Biederman et al. [1982], the contextual information as well as the relative size between objects and location are important cues used by humans to detect objects. In fact, even when using low resolution images humans can distinguish whether an object on the table is a plate or an airplane, since it is unlikely to have an airplane on a table. As pointed out by Oliva and Torralba [2007], in the real world, objects co-occur with other objects and in particular environments, providing a rich source of contextual associations to be exploited by a visual system. These relations between an object and its surroundings are classified by Biederman et al. [1982] into five different classes: interposition, support, probability, position and familiar size. Interposition and support classes refer to physical space, probability, position and familiar size classes are defined as semantic relations since they require access to the referential meaning of the object being considered. Semantic relations include information about detailed interactions among objects in the scene and they are often used as contextual features [Galleguillos and Belongie 2010]. Although much work have been proposed to improve computer vision algorithms, they are still

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

105

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L.P. da Silva, R.L. Granada, J. Monteiro, D.D.A. Ruiz

far from the human ability to recognize objects despite their pose, illumination and occlusions. In this work, we address the problem of object recognition by using the contextual features (features of the scene context) as an indicative to the presence of the object. Thus, our main idea is that the identified context will help the object recognizer to improve the classification of objects that are context dependent. Unlike previous work [Liu et al. 2016], our approach deals with raw images without the need of occlusion of the object in order to detect the context. We propose an approach that relies on a deep neural architecture that comprises two well known pre-trained Convolutional Neural Networks (CNNs): Places365 [Zhou et al. 2017] and Faster R-CNN [Ren et al. 2015]. They are fused in order to improve results in the object recognition task. We perform experiments using PASCAL VOC 2007 [Everingham et al. 2010] and MS COCO [Lin et al. 2014], and examine the influence in the final classification when adding the context to the objects. We found that it is possible to improve the detection of objects by properly considering the context. The rest of this paper is structured as follows. In Section 2, we describe the architecture we use to recognize objects using scene contexts. Section 3 describes the datasets we use in this work, as well as our experimental settings. In Section 4, we report the corresponding results and present a discussion about them. Section 5 reports related work that also perform object recognition using contexts. Finally, the paper ends with our conclusions and future work directions in Section 6. 2.

ARCHITECTURE DESIGN

Early studies have shown the importance of the context to give cues about the recognition of certain objects. As described by Galleguillos and Belongie [2010], the context can be exploited in computer vision in three main forms: (1) the semantic context is described by the likelihood of an object present in the scene and not in other scenes, e.g., a computer and a keyboard are more likely to appear in the same image than a computer and a car; (2) the likelihood of finding an object in certain position to other objects in the scene, e.g., a keyboard is expected to be below the monitor; and (3) the size of the object in the context in relation to the other objects in the scene, e.g., a keyboard is expected to be smaller than a person in the scene. In our work, we mainly exploit the semantic context as the likelihood of an object be in a scene and not in other scenes. It is important to note that not all objects have a strong relationship with the context, e.g., a person may appear in many contexts such as in a house or on the street, while a fire hydrant tends to appear always on the sidewalk. In our work, the likelihood of an object be detected in a scene will tend to increase when the object has a strong relationship with the context and to decrease when the object is out of the contexts previously seen. In order to do that, we propose an approach that fuses two pre-trained deep architectures that extract information of the objects and information of the context separately. Hence, the likelihood of a detected object may be changed by the context in which the object is inserted. Figure 1 illustrates our approach, which receives the raw image as input and pre-process this image to fit to the input of each deep architecture (Faster R-CNN and Places365). Faster R-CNN extracts features of objects while Places365 extracts features of the scene from a pre-processed version of the original image. Our late fusion approach is similar to previous work [Karpathy et al. 2014; Simonyan and Zisserman 2014; Li et al. 2017], where features from the two streams of independent CNNs are merged in the last fully connected layer. We briefly explain each of these deep architectures. 2.1 Pre-Processing images The pre-processing step intends to adapt raw images from input to fit in each deep architecture (Faster R-CNN and Places365). While Faster R-CNN works with high resolution images with different sizes, Places365 works with low resolution and fixed input size. In order to meet Places365 requirements, we first transform the input image into a square by cropping the largest dimension and maintaining Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

106

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Using Scene Context to Improve Object Recognition

conv1

conv2

conv3

conv4

conv5 1*512*18*31

conv_rpn

roi fc fc 300*512*7*7 300*4096 300*4096

rep 300*4096 conv1

conv2

conv3

conv4

conv5 1*512*7*7

Prediction 300*324 Bounding box prediction

concat 300*8192

rpn_bbox_pred

Input

Image 1*3*224*224

3

rpn_cls_score

Faster R-CNN (Object detection) Image 1*3*600*1000

·

Softmax 300*81 Object classification

fc fc 1*4096 1*4096

Softmax 1*365 Place classification

Places365 (Place recognition)

Fig. 1.

Architecture of our approach, which consists of the concatenation of Faster R-CNN and Places365.

the lowest dimension, e.g., an input image of 460×640 is cropped to 460×460. The cropped image is then resized to 224×224, which corresponds to the input image of Places365. 2.2 Faster R-CNN Faster R-CNN [Ren et al. 2015] is a deep architecture to detect objects that contains two modules: a Region Proposal Network (RPN) and a Fast R-CNN detector [Girshick 2015]. The RPN module contains a deep fully convolutional network that receives an image as input and outputs a set of region proposals (i.e., object bounds) each with an objectness score for object detection. The region proposals are performed by sliding over the last shared convolution feature map to determine whether the region is an object or not. Then, Fast R-CNN detector uses the proposed regions for detection. Ren et al. [2015] made available the deep architecture in two flavours, using ZFNet [Zeiler and Fergus 2014] and VGG-16 [Simonyan and Zisserman 2015]. In our work, we use the VGG-16 as convolutional neural network. 2.3 Places365 Places365 [Zhou et al. 2017] is a CNN model pre-trained using the 1,803,460 images of Places365Standard dataset. Image ground truth label verification was performed by crowdsourcing the task with Amazon Mechanical Turk (AMT) and contains a list of the categories of environments encountered in the world, such as bedroom, train station platform, conference center, veterinarians office etc. Zhou et al. [2017] made available the pre-trained models using three popular CNN architectures: AlexNet [Krizhevsky et al. 2012], GoogLeNet [Szegedy et al. 2015], and VGG-16 [Simonyan and Zisserman 2015]. In our work, we use the VGG-16 version of the pre-trained network since we intend to keep the output of the network with the same dimension as the output of Faster R-CNN, favouring the concatenation of both networks. 2.4 Fusion network Our fusion method intends to merge the information of the scene, i.e., the context of an object, to the information of each object in order to improve its classification. The fusion is inspired by the approach proposed by Li et al. [2017], which concatenates the two fully connected layers prior to classification. For model fitting, we freeze all layers of Places365 and all but the last fully connected layers of the Faster R-CNN. Such configuration allows the architecture to learn objects while being affected by the scene context. It is important to note that as Faster R-CNN outputs 300 regions of interest Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

107

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L.P. da Silva, R.L. Granada, J. Monteiro, D.D.A. Ruiz

(RoI ) images for each input image, and Places365 extracts features for a single image, before the concatenation layer we have to multiply the number of features from Places365 up to the number of (RoI ) images from Faster R-CNN. Figure 1 illustrates the replication layer with 300 dimensions – layer “rep”, and the concatenation layer prior to the object classification – layer “concat”. As explained by Li et al. [2017], the information of the scene does not contribute to predict the coordinates of bounding boxes. Thus, the output for predicting the coordinates of bounding boxes is separated of the output for predicting the context. 3.

EXPERIMENTS

In this section, we describe the datasets used in our experiments and the implementation details used in our approach. 3.1 Datasets PASCAL VOC 2007 1 is a dataset that consists of images representing realistic scenes, where each image has at least one object. This dataset was published through the PASCAL Visual Object Classes Challenge 2007 and contains 20 different object classes, having a total of 9,963 images with 24,640 annotated objects. Classes are grouped as Person (person), Animal (bird, cat, cow, dog, horse, sheep), Vehicles (aeroplane, bicycle, boat, bus, car, motorbike, train) and Indoor (bottle, chair, dining table, potted plant, sofa, tv/monitor). The dataset is split into 2,501 images for training, 2,510 images for validation, and 4,952 images for testing models. It is important to mention that as the main intention of PASCAL VOC is object recognition, most images contain few information about context, with objects filling almost the whole image. MS COCO 2 is a dataset that contains images of complex everyday scenes with common objects in their natural context. This dataset addresses core research problems in scene understanding, such as detecting non-iconic views, i.e., objects in background, partially occluded and amid clutter. Objects’ spacial location are annotated using bounding boxes and pixel-level segmentation. The dataset contains 80 classes to define objects and a total of 123,287 images, which are divided into 82,783 for training and 40,504 for validate and test models. Compared to PASCAL VOC 2007 [Everingham et al. 2010], the MS COCO dataset contains considerably more objects per scene and small labeled objects. 3.2 Network settings In order to perform experiments we developed our architecture containing two VGG-16 [Simonyan and Zisserman 2015] using the Caffe3 framework. Our architecture allows to load two pre-trained models that run in parallel and a fully connected layer that performs the fusion of both networks. As we do not train a model for scene recognition, a pre-trained model of Places365 [Zhou et al. 2017] is mandatory for extracting features from scenes. The VGG-16 pre-trained model is freely available in MIT CSAIL Computer Vision Website4 . We perform two experiments: loading a model pre-trained in PASCAL VOC 2007 [Everingham et al. 2010] dataset and training from scratch a model using MS COCO [Lin et al. 2014] dataset. The pre-trained version of the VGG-16 network contains weights learned using the PASCAL VOC 2007 dataset and was downloaded from Faster R-CNN Website5 . 1 http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html 2 http://mscoco.org/dataset/ 3 http://caffe.berkeleyvision.org/ 4 https://github.com/CSAILVision/places365 5 https://github.com/rbgirshick/py-faster-rcnn

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

108

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Using Scene Context to Improve Object Recognition

·

5

The network trained from scratch in the MS COCO dataset contains the same parameters as described by Ren et al. [2015] and consists of initializing weights from a zero-mean Gaussian distribution with standard deviation 0.01. All images have their pixels subtracted by the mean pixel values per channel of all training images. We use a learning rate of 10−3 , dropping it to 10−4 after 240k iterations, and a momentum of 0.9 with a weight decay of 5×10−4 . All convolutions use rectified linear activation units (ReLU). To minimize the chances of overfitting, we apply dropout on the fully-connected layers with a probability of 50%. We fix the number of RoI generated by Faster R-CNN to 300, since this value achieved the best results by Ren et al. [Ren et al. 2015]. Each iteration forwards a single image and the network stops after 490k iterations. 3.3 Fusion settings As the output of Faster R-CNN generates 300 RoI for each image, we have to replicate the number of features generated by Places365 network up to 300. The last fully connected layer of the network is trained for 240k iterations when using the MS COCO dataset and for 70k iterations when using the pre-trained version of PASCAL VOC dataset. The fully connected layer is trained using a learning rate of 10−3 , reducing it to 10−4 after 75% of the training in both datasets, and a dropout of 50% for both networks. 3.4 Evaluation In order to evaluate our approach, we compare the results using the testing set of MS COCO [Lin et al. 2014] and PASCAL VOC 2007 [Everingham et al. 2010] datasets. We verify the mean Average Precision (mAP ) using Intersection over Union (IoU ) scores, i.e., we consider a bounding box as correct evaluating if the ratio of the area of overlap between the predicted bounding box and the ground-truth bounding box, and the area encompassed by both bounding boxes are greater that threshold. We vary the threshold of IoU from 0 to 1, increasing by 0.1 each step for each tested network. To compare the addition of context features with object features, we also test our Faster R-CNN trained from scratch with similar parameters of the original work that introduces Faster R-CNN [Ren et al. 2015]. 4.

RESULTS AND DISCUSSION

The mAP scores achieved by each approach using the IoU threshold varying from 0 to 1 in each tested dataset is presented in Figure 2. As we can observe, when compared with the Faster R-CNN trained from scratch without using the scene context, the mAP for most of the thresholds is lower when using the context, an expected result. MS COCO

0.7 0.6

0.6 0.5

0.4

mAP

mAP

Fusion Faster R-CNN

0.7

0.5

0.3

0.4 0.3

0.2

0.2

0.1 0.0 0.0

PASCAL VOC

0.8

Fusion Faster R-CNN

0.1 0.2

0.4

IOU

0.6

0.8

1.0

0.0 0.0

0.2

0.4

IOU

0.6

0.8

1.0

Fig. 2. Mean Average Precision for PASCAL VOC 2007 and MS COCO datasets with different values of Intersection over Union Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

109

5th KDMiLe – Proceedings

·

6

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L.P. da Silva, R.L. Granada, J. Monteiro, D.D.A. Ruiz Network

1.0

Fusion Faster R-CNN

0.8 AP

0.6 0.4 0.2 0.0

cle ane opl bicy aer

bird

t boa

tle bot

bus

cat

car

ir cha

se hor

le g tab do ing n i d Class

cow

tor mo

e n p nt bik perso pla shee te d t o p

sof

a

in tr a

tvm

tor oni

Fig. 3. Average precision per class using an Intersection over Union (IoU) fixed in 0.5 in the PASCAL VOC 2007 dataset.

A comparison between the average precision achieved by each class using an IoU fixed in 0.5 with PASCAL VOC 2007 dataset is presented in Figure 3. Due to space constrains, we decided to analyze the average precision using only PASCAL VOC 2007, since MS COCO contains 80 classes. In Figure 3, we can observe that values of precision are very close between both fusion methods. In almost half of the classes, some of the fusion methods overcome the result achieved by Faster R-CNN without considering the scene context. Although in some cases it seems worse to use the context to identify objects, our intention in this paper is to increase the confidence in objects that are context related and decrease the confidence when objects are not related to the context. For instance, suppose a classifier without using the context predicts an arm chair in the image with confidence of 50%. When using the scene context, our intention is to increase this confidence if the arm chair is located in a living room or decrease this confidence if the arm chair is located in a scene where it should not appear frequently, such as a forest. In order to perform such analysis, we select some images from the dataset and analyze the probability scores of each object in the scene when classifying with and without context. As illustrated in Figure 4 (a), the image of a living room classified by Faster R-CNN without context contains objects identified as a bird (63.3%), an oven (64.5%), a chair (on the left, 84.7%) and a chair (on the right, 69.6%). When the same image is classified using the features of the context (Figure 4, (b)), the bounding boxes of objects classified as bird, oven and chair (on the left) are eliminated since their likelihood decreased to less than 50%; the chair on the right has its likelihood decreased to 51.1% and a couch became identified in the same position with 55.6%. It is interesting to observe that some objects not directly related to the living room have their likelihood decreased and objects that are related to the scene have their likelihood increased, e.g., the couch on the right was initially identified by Faster R-CNN without using the context with a likelihood inferior to 50%. On the other hand, we analyze images where the object is out of its context in order to observe the effect of the context on the likelihood of the object, e.g., the image from SUN 09 dataset [Jin bird 0.633

tv 0.995

potted plant 0.820

oven 0.645

potted plant 0.898

cat 0.945

chair 0.847

(a) Fig. 4.

tv 0.992

couch 0.988

chair 0.696

couch 0.972

cat 0.956

elephant 0.706

chair 0.511 couch 0.556

(b)

bed 0.789

bed 0.954

(c)

(d)

Example of objects recognition using Faster R-CNN (a) and (c) and our architecture (b) and (d).

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

110

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Using Scene Context to Improve Object Recognition

·

7

et al. 2010] presented in Figure 4 (c) illustrates a bed in the middle of the forest, which is not an usual context for a bed. Forwarding this image through Faster R-CNN without using the context, bed is classified with a likelihood of 95.4% and an elephant with a likelihood of 70.6% – elephant is incorrectly classified by the network. When the same image is forwarded by our network using the context (Figure 4 (d)), the bounding box indicating an elephant is classified with a likelihood under 50%, not appearing in the image, while the bed has its likelihood decreased to 78.9%, indicating that the features of the context influenced in their classification. 5.

RELATED WORK

Liu et al. [2016] propose an approach to address a fine-grained level classification problem using twostream contextualized CNN. Their approach uses a network, named content net, to capture object features and a network, named context net, to capture background features. While the content net is fed with images extracted from bounding boxes of objects, the context net is fed with the whole image having the bounding box region filled with pixels from equivalent position of the mean image calculated across the training set. The two-stream networks are merged in a fusion layer that automatically learns weights from content and context features and outputs the final classification. Bell et al. [2016] explore a contextual and multi-scale idea, presenting a new approach called InsideOutside Net (ION), which is able to detect objects exploiting information from inside and outside regions of interests. Bell et al. use object proposal detectors and dynamic pooling to evaluate the different RoI candidates in an image. A spatial recurrent neural network computes the features from the contextual information outside the region of interest. Object and contextual information are concatenated and pass through several fully connected layers for classification. Bell et al. [2016] Experiments using PASCAL VOC 2012 [Everingham et al. 2010] achieve 77.9% of mAP and using MS COCO [Lin et al. 2014] achieve 33.1% of mAP. Li et al. [2017] developed a novel attention to context CNN (AC-CNN)-based object detection model. Their approach incorporate global and local contextual information into a CNN using a global contextualized (AGC) subnetwork and a multi-scale local contextualized (MLC) subnetwork. AGC is responsible to highlight useful global contextual locations through multiple stacked long short-term memory (LSTM) [Hochreiter and Schmidhuber 1997] layers, while MLC capture both inside and outside contextual information, capturing surrounding local context. Both global and local context are fused by fully connected layers prior to classification. Li et al. [2017] Experiments are performed using PASCAL VOC 2007 achieving 72.4% of mAP and PASCAL VOC 2012 achieving 70.6% of mAP. 6.

CONCLUSIONS AND FUTURE WORK

In this work, we developed a novel architecture for object recognition based on two convolutional neural networks (Faster R-CNN [Ren et al. 2015] and Places365 [Zhou et al. 2017]). The pipeline of the architecture includes a CNN focusing in recognizing objects and another CNN recognizing the context of the scene. We concatenate the object features with the context features to predict the class of each object. We perform experiments using two different datasets: MS COCO [Lin et al. 2014] and PASCAL VOC 2007 [Everingham et al. 2010]. The results of our experiments show that although our object recognition approach does not improve the overall performance when compared to the state-of-the-art recognizer, it performs very well when the object is situated in its proper context. Preliminary analysis of our architecture demonstrates that the likelihood of an object tends to increased when the scene context is related to the object, and to decrease when the object is out of its context. As future work we intend to create subsets of the MS COCO and the PASCAL VOC 2007 datasets, focusing on discarding images with objects that are not dependent of the context. These new datasets Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

111

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L.P. da Silva, R.L. Granada, J. Monteiro, D.D.A. Ruiz

may indicate whether our approach improves the accuracy for objects that are context-dependent. ACKNOWLEDGMENT

This paper was achieved in cooperation with Hewlett Packard Brasil LTDA. using incentives of Brazilian Informatics Law (Law no 8.2.48 of 1991). REFERENCES Bell, S., Lawrence Zitnick, C., Bala, K., and Girshick, R. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. CVPR’16. IEEE Computer Society, Washington, DC, USA, pp. 2874–2883, 2016. Biederman, I., Mezzanotte, R. J., and Rabinowitz, J. C. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology 14 (2): 143–177, April, 1982. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2): 303–338, June, 2010. Galleguillos, C. and Belongie, S. Context based object categorization: A critical survey. Computer Vision and Image Understanding 114 (6): 712–722, June, 2010. Girshick, R. Fast r-cnn. In Proceedings of the 2015 IEEE International Conference on Computer Vision. ICCV ’15. IEEE Computer Society, Washington, DC, USA, pp. 1440–1448, 2015. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation 9 (8): 1735–1780, November, 1997. Jin, C. M., Lim, J. J., Torralba, A., and Willsky, A. S. Exploiting hierarchical context on a large database of object categories. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. CVPR’10. IEEE Computer Society, Washington, DC, USA, pp. 129–136, 2010. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. CVPR’14. IEEE Computer Society, Washington, DC, USA, pp. 1725–1732, 2014. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS’12. Curran Associates Inc., USA, pp. 1097–1105, 2012. Li, J., Wei, Y., Liang, X., Dong, J., Xu, T., Feng, J., and Yan, S. Attentive contexts for object detection. IEEE Transactions on Multimedia 19 (5): 944–954, May, 2017. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision. ECCV 2014. Springer International Publishing, Zurich, Switzerland, pp. 740–755, 2014. Liu, J., Gao, C., Meng, D., and Zuo, W. Two-stream contextualized cnn for fine-grained image classification. In Proceedings of the 13th AAAI Conference on Artificial Intelligence. AAAI’16. AAAI Press, Phoenix, Arizona, USA, pp. 4232–4233, 2016. Oliva, A. and Torralba, A. The role of context in object recognition. Trends in Cognitive Sciences 11 (12): 520–527, December, 2007. Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems. NIPS’15. MIT Press, Cambridge, MA, USA, pp. 91–99, 2015. Simonyan, K. and Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems. NIPS’14. MIT Press, Cambridge, MA, USA, pp. 568–576, 2014. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations. ICLR’15. San Diego, USA, 2015. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. CVPR’15. IEEE Computer Society, Washington, DC, USA, pp. 1–9, 2015. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the 13th European Conference on Computer Vision. ECCV 2014. Springer International Publishing, Zurich, Switzerland, pp. 818–833, 2014. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99): 1–14, July, 2017.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

112

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

An Empirical Comparison of Hierarchical and Ranking-Based Feature Selection Techniques in Bioinformatics Datasets Luan Rios Campos1 , Matheus Giovanni Pires1 Universidade Estadual de Feira de Santana [email protected] [email protected] Abstract. Feature selection stands as an important procedure that may increase the accuracy of classification algorithms because it is possible to reduce the number of irrelevant or redundant information from datasets. This article discusses the use of six different ranking-based feature selection methods, along with a method for analysing redundancy in the Gene Ontology (GO) terms features from four model organisms’ datasets. It is also proposed three different ranked lists from the combination of the other six ranked lists, considering their mean, median and weighted mean ranked positions. Two classification algorithms, K-Nearest Neighbour (KNN) and Naive Bayes, are used to evaluate how relevant the new datasets are when compared to the original and statistical analysis, using the Friedman Test and Holm’s post-hoc, were applied to determine whether there are or not differences amongst the methods used. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications; I.2.6 [Artificial Intelligence]: Learning Keywords: feature selection, gene ontology, ranking-based feature selection, classification

1.

INTRODUCTION

Classification is an important task encountered in various fields such as pattern recognition, decision making, data mining and modeling. The classification task can be roughly described as defining a class, among a set of classes, for each instance of a dataset. This dataset represents a particular domain, where each instance is composed of features that describe the data. Larger datasets, which are present in most of the real problems, directly affect accuracy and computational cost of a classification algorithm. In this context, feature selection (FS) stands as an important procedure that tries to increase the accuracy of classification algorithms by reducing the number of irrelevant or redundant information (features) from datasets [Liu and Motoda 1998]. We can cite two categories into which FS methods can be classified: (1) wrapper methods, where the selection criterion is dependent on the learning algorithm, being a part of the fitness function [Kohavi and John 1997]; (2) filtering methods, where the selection criterion is independent of the learning algorithm (separability measures are employed to guide the selection) [Guyon and Elisseeff 2003]. One way to apply a filter approach to select a subset of features is by using ranking-based methods, in which each feature of a dataset is sorted in the ranked list according to the score calculated by the method’s algorithm. The selection of the most relevant features, then, can be made by applying a

ACKNOWLEDGMENT We thank Alex Freitas, University of Kent, UK, for valuable comments in the early phase of this research. c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

113

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Luan R. Campos and Matheus G. Pires

threshold in which all features below this threshold are discarded. This work has two goals. Firstly, to reduce the large number of features from four datasets, which represent model organisms, namely: Drosophila melanogaster, Mus musculus, Caenorhabditis elegans and Saccharomyces cerevisiae. Secondly, to classify the genes from these four datasets. Each gene has to be classified into one of two classes, pro-longevity or anti-longevity, based on the values of their features. A gene represents an instance of the dataset and it can be annotated with a number of Gene Ontology (GO) terms, in which each term refers to a type of biological process. Pro-longevity genes are those whose decreased expression (due to knockout, mutations or RNA interference) reduces lifespan and/or whose over expression extends lifespan. The anti-longevity genes are those whose decreased expression extends lifespan and/or whose over expression decreases it [Tacutu et al. 2012]. The Gene Ontology database [Ashburner et al. 2000] [Consortium et al. 2004] is a reliable source which provides information about different types of genes, such as the biological processes they perform or their molecular functions, identifying them by unique terms (GO terms). In this work, nine ranking-based methods are used to sort the features in decreasing order (from the most to the less important) considering the dataset’s classes. After running the ranking methods over the datasets, a redundancy analysing method is executed to check for redundant information between the features (GO terms). Redundant information can exist in these datasets due to the hierarchical nature of GO terms: when a term is a descendant of another, it carries the same genetic information of its ancestor plus the information from itself. This check is made because the ranking methods do not guarantee that redundancy is reduced in the dataset, since the ranking is based on the correlation between each individual feature and the class variable, ignoring interaction among features. The redundancy analysing method works with the 1-NN (nearest neighbour) and Naïve Bayes classifiers. This article is organised as follows: section 2 provides an overview of the filter and wrapped approaches of feature selection. Section 3 describes the six ranking methods used in our experiments, pointing out their advantages and disadvantages. Section 4 presents how the proposed redundancy analysing method works and how the results are released. Section 5 shows the results obtained with feature selection using the partial and average selection approaches for each dataset. Finally, Section 6 summarises this work’s contribution showing what was concluded from the results and suggesting future research directions. 2.

FEATURE SELECTION

In the data mining context, many applications have got a large amount of features in a dataset to be analysed. Sometimes, a certain number of these features can be redundant or irrelevant and, therefore, mislead the classification algorithm. As [Liu and Motoda 1998] mention, some types of classification algorithms can be used to select features, but, if there are too many irrelevant or redundant features, the classification algorithm, by itself, struggles to overcome this problem. In order to compensate this poor scalability of the classification algorithm, extra feature selection methods are required [Liu and Motoda 1998]. Feature selection methods are specialised in reducing the feature space by selecting a subset of the original, which, consequently, simplifies the feature space and removes useless features [Liu and Motoda 1998]. As previously introduced, there are two forms of feature selection: the filter and wrapped approaches. Whilst the wrapper approach relies on a classification algorithm to select features, the filter technique analyses whether a feature is relevant or not according to the dataset alone (independent from any classification algorithm) [Liu and Motoda 1998] [Saeys et al. 2007]. The wrapper approach is a feature selection technique that generates a subset of features from a dataset along with the use of a classification algorithm. [Liu and Motoda 1998] bring out that the classification algorithm is used as a black box (i.e. there is no need to know the implementation of the algorithm, just its interface) and it functions as an evaluator of the subsets. In order to find Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

114

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

·

3

an optimal subset of features, heuristic search methods are often applied [Saeys et al. 2007]. Two advantages, pointed by [Saeys et al. 2007] for the wrapper approach, are the fact that the method interacts with the classification algorithm when selecting subsets and how this method can work with the features’ dependencies. However, the same authors indicate that this approach is computationally expensive and have a higher risk of overfitting. The filter approach selects subsets of features according only to the information on the features of the dataset, not taking into account the classification algorithm during the selection. As [Saeys et al. 2007] affirm, in most cases, as in this work, it is calculated a score for each feature and the ones with a score below a certain value (threshold) are discarded. [Saeys et al. 2007] indicate that this approach has the advantage of being easily scalable to large datasets, faster and simpler than the wrapper approach, as it is independent of the classification algorithm. Therefore, the selection is made only once before evaluating different classifiers. However, the author also points out that this lack of interaction with the classification algorithm can be a disadvantage as the feature dependencies are not considered. 3.

RANKING-BASED METHODS (RM)

In this work we used six different ranking-based methods and we also proposed three different ranked lists as a combination of these six ranking methods considering their features mean, median and weighted mean ranks. Next, we will detail these methods. 3.1 Chi-Squared This method calculates the chi-squared (χ2 ) statistic value to evaluate, individually, how related a feature is to the class variable of a data set. Furthermore, [O’Brien and Vogel 2003] declare that the χ2 method tests the validity of the null hypothesis regarding whether the differences between two sets of data are coincident or not. The rejection of the null hypothesis indicates that two sets are different due to causality. [Liu et al. 2002] state that the feature’s importance increases along with its calculated χ2 value. According to [Liu and Setiono 1995], the Chi-Squared method is capable of working with a multiclass dataset and it stands as a good method of feature selection. Moreover, they affirm this method can also be used in datasets with mixed types of attributes (i.e. numeric or discrete data) because, differently from other feature selection algorithms, it is capable of discretise numeric attributes whilst selecting features. Nevertheless, [O’Brien and Vogel 2003] point out that there is a bigger chance of rejecting the null hypothesis the larger the dataset becomes. 3.2

Information Gain

The Information Gain (IG) method measures the number of bits of information obtained by analysing the presence or absence of a feature in the dataset. The IG is, according to [Hall and Holmes 2003], one of the simplest and fastest ranking methods. However, [Zheng et al. 2004] point out that there can exist some inflated scores and they can affect the hypothesis generalisation accuracy. [Harris Jr 2001] indicates that a categorical (or nominal) feature with many distinct values can achieve a high IG value because some of these values will occur only in a small subset of instances, where it is easier to obtain an unbalanced class distribution - where one class has a much higher relative frequency than the other class(es). The method will recommend this feature as the more suitable to predict the class of new instances because of its high IG score, although this recommendation is not necessarily correct since a class large relative frequency in some subsets of instances may not be statistically reliable, being based on a small number of instances. Therefore, the IG may have a biased result because it can exercise some favouritism to features with many values. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

115

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Luan R. Campos and Matheus G. Pires

3.3 Gain Ratio As an addition to the IG method, the Gain Ration (GR) proposes a correction to the earlier discussed favouritism that Information Gain can generate. Instead of selecting the feature based only on its IG value, the GR method does further calculations. Firstly, it finds the feature’s split information, which is the feature’s information by itself (independent of the class variable), and later it divides the previously calculated IG by this split information, therefore, composing the GR value [Harris Jr 2001]. 3.4 ReliefF Before discussing the ReliefF method, it is important to introduce its precursor: the Relief method. Relief is an instance-based ranking method for problems with two classes which iteratively and randomly samples an instance from the dataset to find its nearest neighbour belonging to the same class (near-hit) of the sampled instance, as well as the nearest neighbour belonging to its opposite class (near-miss) using Euclidian distance. ReliefF acts as an extension of the Relief: it works in the same way as the latter, but it is designed to handle noisy data sets and multi class problems. Instead of analysing a single neighbour for each class of the sample instance, ReliefF takes into account the k neighbours with respect to the same conditions applied in Relief. [Hall and Holmes 2003] indicate that the utility of a feature can be determined by how its values in the set of instances belonging to a certain class differs from the ones in the sets of instances belonging to a different class and how the feature’s values shows equality for instances of the same class. Although [Hall and Holmes 2003] also state that the ReliefF’s reliability increases with the size of the dataset, [Kannan and Ramaraj 2010] warn that this method can be computationally expensive if the dataset is very large. 3.5 Symmetrical Uncertainty Symmetrical Uncertainty (SU) is a correlation based measure that relies on the level of uncertainty, or entropy, of a certain variable within a class. The IG is calculated to measure the correlation between features, since it has a symmetrical behaviour for two variables. The SU stands as a mechanism to compensate the IG’s favouritism of features with many values and, according to [Kannan and Ramaraj 2010], a feature is considered good when it has a high correlation with the class but not with any other feature. 3.6 OneR The One-Rule (OneR) is a classifier that generates a set of different classification rules for each value of an attribute and then calculates the error rate of this set of rules. As [Muda et al. ] and [Buddhinath and Derry 2006] point out, the rules are determined according to the frequency that the desired class appears. Although the attribute with the lowest error rate is usually chosen as the one to predict the class’s dataset, in WEKA the error rate is used to build a ranked list in which the attributes with the lowest error rate are considered better, therefore, belonging to the top of the list. [Buddhinath and Derry 2006] state that, despite being a simple approach, the OneR classifier is able to adapt to a dataset with missing values or numeric attributes and it produces rules easier to interpret by humans compared to state-of-the-art learning algorithms. However, [Buddhinath and Derry 2006] indicate that the accuracy of these rules is slightly inferior to the ones produced by the state-of-the-art learning algorithms. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

116

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

·

5

3.7 Combined Ranking Methods Besides the six ranking-based methods described before, there are three more methods. These methods, namely the Arithmetic Average Ranking (AAR), Weighted Average Ranking (WAR) and Median Ranking (MR), are the combination of the Chi2, IG, GR, ReliefF, SU and OneR ranked lists considering their feature’s arithmetic average, weighted average and median ranks, respectively. As the name indicates, for each dataset (model organism), AAR is built calculating the rank of each GO term as the arithmetic mean of the k ranks obtained from six ranking-based methods. Similarly, WAR and MR are built calculating the rank of each GO term as the weighted average and median, respectively. In order to determine the weights for the weighted mean in the WAR method, it is necessary to first calculate the Kendall Correlation Coefficient of each ranking-based method. The Kendall coefficient, as presented in [Abdi 2007], measures how similar two sets of ranks are from the same set of objects, i.e., with regards to this work, it analyses how the values of one ranking-based method is related to the values of another method (e.g., how the rank values produced by Chi-Squared are related to the rank values produced by IG, GR, and so on) for the same dataset by checking the number of rank positions correspondents and not correspondent. The weight (W (Rk )) of a k ranking method is denoted by subtracting the method’s coefficient absolute mean (M ean[Rk ]) of 1, in which the mean is computed over the correlation coefficients between k and each other of the rank methods and its absolute value is used because the coefficients can assume any values in the interval [−1, 1], where −1 indicates a total negative correlation (two methods produce completely opposite ranking of features), 0 indicates no correlation at all and 1 indicates total correlation (two methods produce exactly the ranking of features). 4.

RANKING AND HIERARCHICAL ANALYSIS SYSTEM (RHAS)

In this work, each feature is a binary variable indicating whether or not an instance (representing a gene) is annotated with Gene Ontology (GO) terms (the features). In the GO database, the terms are disposed in a hierarchical Directed Acyclic Graph (DAG) structure. Each node of the DAG represents a GO term and a node can contain none to many other nodes (parent or children) connected to it. The deeper the node, the less generic the term represented is. Note that, due to the nature of the GO term hierarchy, if a given gene is annotated with a given GO term G, this implies the gene is also annotated with all ancestors of G in the GO term hierarchy. This is a form of hierarchical redundancy, as discussed in [Wan et al. 2015]. Consider the case where a term X and its ancestor Y are both relevant to the class of the dataset, that is: their ranking values are above a specific threshold. If X is placed above Y in the ranked list (i.e., X has a better ranking value than Y), there is hierarchical redundancy in the list of features because X contains information about itself and all its ancestors. Therefore, Y shall be discarded since X is capable of providing additional information with regards on Y. We propose a filter feature selection technique that considers both the descendants and the position of a GO term in a ranked list. For each ranked list generated with the RMs described in Section 3, the algorithm iterates over the list, from top (best feature) to bottom (worst feature), and checks whether there are or not descendants of a given term that is better ranked than the term itself. If a descendant of the term is better qualified, the given term is removed from the ranked list. 5.

EXPERIMENTS AND RESULTS

The experiments were made based on the resulting lists obtained with the ranking and hierarchical analysis system. Each list is used to compose a new dataset (with less features than the original one) that differs from each other in the ranking feature selection method applied to it and the GO terms the ranking and hierarchical analysis system chose to remove. That is, considering an arbitrary dataset A containing 100 features, a list A1 is ranked using, say, the Chi2 method and another list Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

117

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Luan R. Campos and Matheus G. Pires

A2 is ranked using, say, the IG method. At this point, both lists have the same 100 features of A, but after applying the RHAS this number may be reduced. Both A1 and A2 list of features, after the RHAS, is used to build different datasets, say X and Y, respectively, containing the same number of instances that A has but less features. This is done for each one of the 9 RFSM on the four different model organism’s datasets. We noticed that after applying the RHAS all datasets had their number of features reduced when compared to the original dataset. However, simply reducing the number of features is not a guarantee that one dataset is better than the other. To fully measure whether or not the RHAS was able to remove redundant features without compromising the original dataset we submitted the results to two different classifiers available on WEKA: K-Nearest Neighbour (KNN) - using the Jaccard Distance algorithm instead of the Euclidean Distance - and Naïve Bayes. All 9 datasets of each model organism were submitted to both KNN and Naïve Bayes classifiers using a 10-fold cross validation technique. Each run of the cross validation is used to measure both the sensitivity and specificity regarding the pro-longevity class as the positive one of the confusion matrix. Furthermore, these two values are used at the end of the classification procedure to calculate the geometric mean a classifier has for a given dataset. Table I and Table II show all geometric means obtained, respectively, with the KNN and Naïve Bayes classifiers in all resulting datasets of the four model organisms considered in this work along with their respective position between parenthesis (in which 1 is the best and the lowest value is the worst) in a given dataset.

Table I.

Dataset

Std.

GR

SU

Geometric Mean of KNN classifier

WAR ReliefF

IG

MR

Chi2

OneR

AAR

D. melanogaster 61.49 (3) 58.09 (7) 58.09 (7) 62.04 (2) 58.63 (5) 58.09 (7) 62.59 (1) 58.41 (6) 59.52 (4) 61.49 (3) Mus musculus

67.13 (5) 64.76 (8) 68.00 (3) 69.93 (2) 65.61 (7) 67.45 (4) 70.48 (1) 63.14 (10) 63.50 (9) 66.44 (6)

C. elegans

54.94 (6) 57.36 (2) 53.57 (8) 55.31 (5) 52.18 (9) 52.89 (10) 56.75 (3) 57.70 (1) 54.48 (7) 55.87 (4)

S. cerevisiae

53.31 (8) 58.05 (3) 57.90 (4) 55.47 (7) 51.36 (9) 56.08 (5) 59.64 (1) 59.47 (2) 55.48 (6) 55.47 (7)

Avg. rank

5.5

5

Dataset

Std.

GR

Table II.

5.5

4

7.5

6.5

1.5

4.75

6.5

5

Chi2

OneR

AAR

Geometric Mean of Naïve Bayes classifier

SU

WAR ReliefF

IG

MR

D. melanogaster 61.57 (4) 59.90 (7) 60.62 (6) 66.84 (1) 57.59 (9) 61.10 (5) 63.20 (3) 59.44 (8) 57.15 (10) 63.68 (2) Mus musculus

63.42 (4) 59.20 (7) 58.71 (8) 65.99 (3) 61.70 (5) 57.65 (9) 66.73 (1) 59.20 (7) 60.60 (6) 66.08 (2)

C. elegans

56.80 (8) 58.17 (4) 55.19 (10) 61.35 (1) 57.98 (6) 56.31 (9) 61.32 (2) 57.42 (7) 58.01 (5) 61.20 (3)

S. cerevisiae

59.74 (3) 58.12 (7) 58.46 (5) 61.82 (1) 56.29 (9) 56.44 (8) 60.25 (2) 58.45 (6) 58.63 (4) 61.82 (1)

Avg. rank

4.75

6.25

7.25

1.5

7.25

7.75

2

7

6.25

2

In order to validate the approaches proposed here, the combined RMs and the RHAS, it was necessary to statistically evaluate the geometric means obtained with the classifiers. This was firstly done by applying the Friedman’s Test on the values of Table I and Table II and afterwards using Holm’s post-hoc procedure considering α = 5%. The Friedman’s Test null-hypothesis defends that all algorithms are equivalent and therefore their ranks, calculated from the test, are the same [Demšar 2006]. Rejection of this null-hypothesis allows the analysis to continue with the Holm’s post-hoc test [Demšar 2006], which compares the algorithms two by two under the null-hypothesis that there are no significant differences between them [Garcia and Herrera 2008]. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

118

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

KDMiLe - Symposium on Knowledge Discovery, Mining and Learning

·

7

To what concerns the KNN classifier, the Friedman Test successfully rejected its null-hypothesis and ranked the MR as the best algorithm, followed by WAR, Chi2 and AAR respectively, whilst ReliefF was ranked the worst. This ranking is measured according to the average position of a RMs’ geometric mean in all datasets, i.e., if a RFSM has the highest geometric mean in three datasets and the third highest in one dataset its ranking values 1.5. Holm post-hoc demonstrated that whilst the MR and WAR methods indeed show significant differences among all other RMs, the Chi2, although well ranked, did not demonstrate significant difference to two worse ranked methods: GR and AAR. The Friedman test also rejected the null-hypothesis on the results of the Naïve Bayes classifier. In this case, WAR was the best ranked algorithm, followed respectively by AAR, MR and the standard dataset, whilst IG was the worst. According to Holm post-hoc results, WAR method is significantly different from all other methods and AAR, although slightly better ranked, did not demonstrate significant difference from the MR method. With regard to the combined RMs, we can notice that the WAR method was present in the top three better ranked algorithms in both classifiers, as the second best in KNN and as the best algorithm in Naïve Bayes. Despite also appearing in the top three (the best in KNN and the third best in Naïve Bayes), the MR’s average position is worse than WAR’s. Finally, AAR appears in the top 4, occupying the fourth and second position respectively considering KNN and Naïve Bayes, and is the worst of the combined RMs. Generally speaking, WAR, MR and AAR’s resulting datasets had a better performance in both classifiers than the standard dataset (with no filter selection). On the other hand, with regards to the six raw RMs, whilst only Chi2 and GR resulting datasets performed best in KNN, none of them had better performance than the standard dataset in Naïve Bayes. We had higher expectations in the WAR method results since the use of Kendall Coefficients should smooth the discrepancy of rank positions existing in the raw RMs and therefore give a bit more of importance to the results of little correlated methods. The results demonstrated, indeed, that WAR method was able to meet our expectations. We inferred that its highest performance among the other RMs is due to the use of Kendall Coefficients and therefore production of a more equal combination of the ranked positions. 6.

CONCLUSION

In this work we proposed a filtered feature selection that considers both the hierarchical information and the ranking positions of Gene Ontology terms (features) on datasets of four model organisms ( Drosophila melanogaster, Mus musculus, Caenorhabditis elegans and Saccharomyces cerevisiae). The ranking positions are determined by nine different ranking-based methods, in which three of them are the combination of the other six. The feature selection occurs after generating all nine ranked lists from the standard dataset of each model organism. For each ranked list generated with the RMs described before, the algorithm iterates over the list, from top (best feature) to bottom (worst feature), and checks whether there are or not descendants of a given term that is better ranked than the term itself. If a descendant of the term is better qualified, the given term is removed from the ranked list. All datasets had their number of features reduced when compared to the original dataset after the feature selection. However, it was necessary to apply them to a classifier to analyse whether or not this reduction, at least, maintained the same level of precision as the standard dataset. The precision of this classification was measured by calculating the geometric mean of a dataset. In order to analyse the geometric mean we used two statistical methods: the Friedman Test and the Holm’s post-hoc. The former tests the null-hypothesis that all algorithms are equivalent, whilst the latter is used only if the first test rejects its null-hypothesis and tests if there are significant differences on a pairwise comparison of the algorithms. The results demonstrated that the combined RMs performed better than the standard and the other Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

119

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Luan R. Campos and Matheus G. Pires

RMs datasets both in the KNN and Naïve Bayes classifiers. On the other hand, the standard dataset had better performance than all raw RMs in Naïve Bayes and was only worse than Chi2 and GR considering KNN classification. We conclude that the redundancy analysing method can successfully reduce the number of features from the nine ranking-based method’s datasets. However, as the feature selection method considers the feature position in the ranked list - and this position varies according to the ranking method being applied - some important features can be wrongly removed and, therefore, reduce the dataset performance. An attempt to improve this selection is to combine the ranked lists, considering the feature’s mean, median and average mean position into three other lists. This attempt proved to be successful, especially when the redundancy analysing method is used over the weighted mean ranked list, which demonstrated the highest average geometric mean values. REFERENCES Abdi, H. pp. 508–510. SAGE Publications, Thousand Oaks, CA, pp. 508–510, 2007. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., et al. Gene ontology: tool for the unification of biology. Nature genetics 25 (1): 25–29, 2000. Buddhinath, G. and Derry, D. A simple enhancement to one rule classification. Department of Computer Science & Software Engineering. University of Melbourne, Australia, 2006. Consortium, G. O. et al. The gene ontology (go) database and informatics resource. Nucleic acids research 32 (suppl 1): D258–D261, 2004. Demšar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research 7 (Jan): 1–30, 2006. Garcia, S. and Herrera, F. An extension on“statistical comparisons of classifiers over multiple data sets”for all pairwise comparisons. Journal of Machine Learning Research 9 (Dec): 2677–2694, 2008. Guyon, I. and Elisseeff, A. An introduction to variable and feature selection. Journal of machine learning research 3 (Mar): 1157–1182, 2003. Hall, M. A. and Holmes, G. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data engineering 15 (6): 1437–1447, 2003. Harris Jr, E. Information gain versus gain ratio: A study of split method. Available online on WWW:< http://www. mitre. org/work/tech papers/tech papers vol. 1, 2001. Kannan, S. S. and Ramaraj, N. A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm. Knowledge-Based Systems 23 (6): 580–585, 2010. Kohavi, R. and John, G. H. Wrappers for feature subset selection. Artificial intelligence 97 (1-2): 273–324, 1997. Liu, H., Li, J., and Wong, L. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome informatics vol. 13, pp. 51–60, 2002. Liu, H. and Motoda, H. Feature extraction, construction and selection: A data mining perspective. Vol. 453. Springer Science & Business Media, 1998. Liu, H. and Setiono, R. Chi2: Feature selection and discretization of numeric attributes. In Tools with artificial intelligence, 1995. proceedings., seventh international conference on. IEEE, pp. 388–391, 1995. Muda, Z., Yassin, W., Sulaiman, M., and Udzir, N. Intrusion detection based on k-means clustering and naïve bayes classification. in international conference on information technology in asia (cita 11) 2011 (pp. 1-6). O’Brien, C. and Vogel, C. Spam filters: bayes vs. chi-squared; letters vs. words. In Proceedings of the 1st international symposium on Information and communication technologies. Trinity College Dublin, pp. 291–296, 2003. Saeys, Y., Inza, I., and Larrañaga, P. A review of feature selection techniques in bioinformatics. bioinformatics 23 (19): 2507–2517, 2007. Tacutu, R., Craig, T., Budovsky, A., Wuttke, D., Lehmann, G., Taranukha, D., Costa, J., Fraifeld, V. E., and De MagalhaÌČes, J. P. Human ageing genomic resources: integrated databases and tools for the biology and genetics of ageing. Nucleic acids research 41 (D1): D1027–D1033, 2012. Wan, C., Freitas, A. A., and Magalhães, J. P. D. Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 12 (2): 262–275, 2015. Zheng, Z., Wu, X., and Srihari, R. Feature selection for text categorization on imbalanced data. ACM Sigkdd Explorations Newsletter 6 (1): 80–89, 2004.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

120

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

SOUTH-N: um método para a detecção semissupervisionada de

outliers

em dados de alta dimensão

Lucimeire A. Silva, Maria Camila Nardini Barioni, Humberto L. Razente

Universidade Federal de Uberlândia

[email protected] [email protected] [email protected]

Abstract.

Com o crescente aumento da quantidade de dados armazenados, a área de mineração de dados tornou-se imprescindível para que seja possível manipular e extrair conhecimento a partir desses dados. Grande parte dos trabalhos nessa área focam em encontrar padrões nos dados. Porém os dados fora do padrão (anomalias) também podem agregar muito no conhecimento do conjunto de dados em estudo. O estudo, o desenvolvimento e o aprimoramento de técnicas de detecção de outliers são objetivos importantes e têm se mostrado útil em diversos cenários, como: detecção de fraudes, detecção de intrusão e monitoramento de condições médicas entre outros. O trabalho apresentado aqui descreve um novo método para detecção semissupervisionada de outliers em dados com alta dimensionalidade. Os experimentos realizados com diversos conjuntos de dados reais indicam a superioridade do método proposto em relação aos métodos da literatura selecionados como linha de base. Categories and Subject Descriptors: H.2.8 [Database ligence]: Learning

Management]:

Database Applications; I.2.6 [Articial

Intel-

Keywords: data mining, high-dimensional data analysis, hubness, semi-supervised detection of outliers

1. INTRODUÇÃO A área de pesquisa em Mineração de Dados é uma área multidisciplinar que integra conceitos e técnicas de várias outras áreas, como: Bancos de Dados, Aprendizado de Máquina, Inteligência Articial, entre outras. As técnicas desenvolvidas nessa área têm sido usadas para analisar grandes volumes de dados com o objetivo de encontrar correlações ou padrões nesses dados que representem alguma informação ou conhecimento. As técnicas de Mineração de Dados existentes são classicadas de acordo com o tipo de conhecimento a ser extraído em: detecção de agrupamentos, associação, detecção de , regressão, detecção de classes e etc [Aggarwal 2015]. O principal foco do trabalho apresentado aqui foi contribuir para a tarefa de detecção de propondo um algoritmo baseado em uma estratégia semissupervisionada para dados de alta dimensão. podem surgir devido a erros humanos, erros instrumentais, mudanças no comportamento ou falhas nos sistemas, assim como desvios naturais em sociedades. De forma técnica, um é uma instância discrepante de acordo com um padrão observado no conjunto de dados [Aggarwal and Yu 2001]. A literatura cientíca da área fornece diversas abordagens para a tarefa de detecção de . Essas abordagens podem ser classicadas em três categorias de acordo com o tipo de aprendizado empregado: aprendizado supervisionado, semissupervisionado e não supervisionado [Chandola et al. 2009]. Dentre essas abordagens, as técnicas baseadas nas abordagens não supervisionadas e semissupervisionadas têm mostrado serem as mais relevantes [Daneshpazhouh and Sami 2014]. Atualmente, os conjuntos de dados disponíveis para análise têm aumentado não só em número de outliers

outliers

Outliers

outlier

outliers

Trabalho realizado com apoio nanceiro da CAPES. c 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that Copyright the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

121

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L. A. Silva et. al.

instâncias mas também em número de dimensões. Apesar de existirem trabalhos para lidar com essa questão, tratar conjuntos de dados compostos por instâncias descritas por um elevado número de dimensões ainda é um desao na área de mineração de dados, principalmente, devido ao problema conhecido como maldição da dimensionalidade [Samet 2005]. Trabalhos recentes tem empregado com sucesso uma característica própria de dados de alta dimensão, conhecida como , na proposta de métodos para várias tarefas de mineração de dados [Radovanovic et al. 2010], [Tomasev et al. 2014] e [Heylen et al. 2017]. mensura a tendência de algumas instâncias, denominadas , ocorrerem com maior frequência nas listas dos K -vizinhos mais próximos de outras instâncias. A estratégia empregada no desenvolvimento do método ( ), proposto neste artigo, consistiu em adotar uma abordagem semissupervisionada baseada apenas em poucas amostras positivas ( ) previamente rotulados como entrada para o algoritmo proposto. Essas amostras são usadas para a obtenção de amostras negativas ( ) por meio de uma estratégia que avalia a vizinhança das amostras com base em informações de estimativa de densidade. A escolha dessa abordagem foi motivada pelo fato de que um especialista consegue rotular com facilidade e mais agilidade comportamentos fora do padrão do que todas as possíveis variações de comportamentos normais em um dado cenário em estudo [Daneshpazhouh and Sami 2015]. Os resultados experimentais obtidos permitem constatar a boa ecácia do quando aplicado em conjuntos de dados reais de diferentes tamanhos e dimensões. O restante do artigo está organizado como descrito a seguir. A Seção 2 apresenta os trabalhos correlatos e os principais conceitos necessários para o entendimento do método proposto. O é descrito na Seção 3. Na Seção 4, o método proposto é comparado com o estado da arte, empregando diferentes conjuntos de dados a m de comprovar sua ecácia. Por m, a Seção 5 apresenta as conclusões do artigo. 2. CONCEITOS FUNDAMENTAIS Os conceitos relacionados com as estratégias empregadas no desenvolvimento do método estão relacionados com as abordagens de detecção de e estratégias para lidar com dados em alta dimensão. A Secção 2.1 trata sobre detecção de e a Seção 2.2 descreve o aspecto . 2.1 Detecção de O estudo, o desenvolvimento e o aprimoramento de técnicas de detecção de são objetivos importantes e têm se mostrado útil para diferentes áreas, como: a identicação de fraudes em cartões de créditos [Agrawal et al. 2015]; a vericação de exames médicos para encontrar anomalias que representam doenças, como o câncer [Gaspar et al. 2011]; a análise do mercado de ações [Song and Cao 2012]; o controle de catástrofes e eventos ambientais [Zheng et al. 2010]; ou, até mesmo, no processamento de imagens para, por exemplo, caracterizar falhas de vigilância, detectando anomalias na multidão em tempo real [Guler et al. 2013]. O termo , também conhecido como anomalia, é empregado para indicar instâncias discrepantes das demais instâncias de um conjunto de dados. Segundo [Hawkins 1980], . Na detecção de , todos os dados do conjunto em análise são classicados como ou . Os são os dados discrepantes ou anormais e os são os dados normais. Existem várias técnicas descritas na literatura para a detecção de . Essas técnicas podem ser categorizadas em três classes de acordo com o tipo de aprendizado empregado: supervisionadas [Gogoi and Borah 2011], não supervisionadas [Lu et al. 2017] e semissupervisionadas [El-Kilany et al. 2016]. A lógica empregada por essas técnicas para a detecção de também pode ser variada. Existem métodos baseados em profundidade, nos quais os são representados pelas instâncias com profundidade rasa [Montes 2014]. Outros métodos são alicerçados em medidas de distância, para hubness

Hubness

hubs

SOUTH-N

Semi-supervised OUTlier de-

tection based on Hubness Neighborhood

outliers

inliers

SOUTH-N

SOUTH-N

SOUTH-N

outliers

outliers

hubness

Outliers

outliers

outlier

`um outlier é uma instância (ou um subconjunto de instâncias) que parece

ser inconsistente quando comparado ao restante do conjunto de dados' outliers

inliers

outliers

outliers

inliers

outliers

outliers

outliers

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

122

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

SOUTH-N: um método para a detecção semissupervisionada de

outliers

em dados de alta dimensão

·

3

os quais as instâncias mais distantes são possíveis candidatas a [Hassanzadeh et al. 2012]. Também existem métodos que apoiam-se em buscar as instâncias que se desviam das demais de acordo com as características inspecionadas no estudo [Breunig et al. 2000]. Por m, existem métodos que adotam abordagens baseadas em agrupamentos que denem como as instâncias isoladas dos demais grupos ou grupos demasiadamente pequenos em relação aos demais [Jiang and Yang 2009]. As técnicas de detecção supervisionada de utilizam um conjunto de dados previamente rotulado por um especialista no assunto em análise para treinar um modelo que depois é utilizado para rotular a base em estudo. Na detecção não-supervisionada de , nenhum conhecimento prévio sobre os dados é fornecido e, portanto, os métodos não possuem nenhuma informação que permita distinguir e . Esses métodos partem do pressuposto que a maioria dos dados são normais e tentam usar estratégias que permitam rastrear os dados que mais se diferem. Já as técnicas de detecção semissupervisionada de utilizam o conhecimento a respeito de algumas instâncias previamente rotuladas para encontrar os . Apesar da praticidade das técnicas não supervisionadas, estas técnicas apresentam altos índices de alarmes falsos e baixa taxa de detecção de [Xue et al. 2010]. Já as técnicas supervisionadas possuem a desvantagem de necessitar de uma grande quantidade de dados rotulados para treinamento. Essas técnicas geram bons resultados, entretanto, elas demandam muito esforço humano o que as torna mais susceptíveis a resultados errôneos devido a um modelo de treinamento incorreto. Além disso, a determinação de um exemplo para todos os possíveis casos de desvio do padrão do conjunto de dados em análise é uma tarefa difícil [Chandola et al. 2009]. Por esses motivos, e por fornecer bons resultados em diferentes contextos, a abordagem semissupervisionada tem se mostrado promissora e, atualmente, está sendo mais estudada. Existem diferentes técnicas que empregam a abordagem semissupervisionada para a detecção de descritas na literatura. Essas técnicas podem ser divididas em três categorias de acordo com a entrada do algoritmo. A primeira categoria emprega exemplos de e como informação de semissupervisão [Xue et al. 2010]. A segunda categoria utiliza como exemplo apenas amostras de . Nessa categoria os exemplos de são chamados de amostras positivas e os exemplos de são denominados amostras negativas [Daneshpazhouh and Sami 2015]. A terceira categoria baseia-se apenas na entrada de dados rotulados como para a detecção de [Blanchard et al. 2010]. Considerando que são raros e difíceis de rotular com precisão, e que, em muitos casos, o conjunto de não está disponível por ser dispendioso rotular instâncias negativas, a segunda categoria de técnicas torna-se mais viável que as demais, por exigir menos esforço humano. A segunda etapa do método emprega uma técnica dessa categoria. 2.2 Aspecto Apesar dos constantes avanços nas pesquisas em detecção de , esta tarefa continua sendo um desao, devido a maldição da dimensionalidade [Samet 2005]. No entanto, há fatores que a alta dimensionalidade geram que podem ser utilizados a favor não só da detecção de como também de outras tarefas de mineração de dados que sofrem os efeitos dessa maldição. Um fator que tem sido estudado e utilizado com sucesso recentemente é denominado aspecto [Flexer 2016]. O aspecto representa a tendência dos dados de alta-dimensão conterem instâncias (chamadas de ) que ocorrem com frequência na listagem dos k-vizinhos mais próximos de outras instâncias. A medida que a dimensionalidade intrínseca dos dados aumenta a distribuição das kocorrências na lista de vizinhos mais próximos de cada instância de dados torna-se distorcida e com maior variância. Segundo [Toma²ev and Buza 2015], a pontuação de uma instância pode ser calculada como segue. Considerando X = {x , ..., x } ⊂ R , um conjunto de instâncias de dados, N (x) denota o número de k -ocorrências de instâncias x ∈ X , isto é, o número de vezes que x ocorre na listagem dos k-vizinhos mais próximos de outras instâncias pertencentes a X . Assim, as instâncias de dados (denominadas ) que aparecem freqüentemente na listagem dos k-vizinhos mais próximos outliers

outliers

outliers

outliers

outliers

inliers

outliers

outliers

outliers

outliers

outliers

outliers

inliers

outliers

inliers

inliers

outliers

outliers

inliers

SOUTH-N

Hubness

outliers

outliers

hubness

hubness

hubs

hubness

1

n

d

k

hubs

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

123

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L. A. Silva et. al.

de outras instâncias são um bom indicativo de regiões mais densas. 3. O método ( ) fornece uma nova abordagem para detecção de que reúne estratégias de semissupervisão, de estimativa de densidade baseada em pontuações e de agrupamento não binário ( ) com o objetivo de contribuir para a análise de conjuntos de dados de alta dimensão. Esse método tem o propósito de identicar e rotular a partir de pouquíssimas informações sobre existentes em repositórios de dados estruturados e parcialmente classicados. Essa abordagem foi escolhida devido a facilidade e praticidade em identicar alguns exemplos que se destoam dos demais ao oposto da diculdade em rotular corretamente todas as variantes dos padrões encontrados nos dados. baseia-se na estratégia (LPU), que propõe o aprendizado a partir de poucos dados positivos, ou , (P ) e dados não rotulados (U ). Ele é composto por duas fases principais, sendo a primeira fase responsável por obter a informação de semissupervisão que será usada na segunda fase. Na primeira fase as amostras negativas são obtidas a partir da estratégia de análise de vizinhança baseada em informações das amostras positivas ( ) informadas como entrada para o método. Na segunda fase do método, a informação de semissupervisão obtida na primeira fase (amostras positivas e negativas ( )) é empregada para guiar o processo de detecção de do conjunto de dados em análise. Essas duas fases são explicadas em mais detalhes nas Seções 3.1 e 3.2. 3.1 Primeira Fase Na primeira fase, a partir do conjunto de dados X = {x ; ...; x }, para o qual é sabido existir um subconjunto P = {p ; ...; p } ⊂ X de amostras positivas ( ) com j < n, e de um conjunto de amostras não rotuladas U , U = X − P , o algoritmo extrai q% de amostras negativas ( ) conáveis. Os passos realizados nessa fase são apresentados no Algoritmo 1 e detalhados a seguir. Algoritmo 1: Primeira Fase Entrada: Conjunto de amostras positivas P , Conjunto de amostras não rotuladas U , Número de vizinhos próximos k, percentual de exemplos de amostras negativas q Saída: Conjunto de exemplos de amostras negativas N E 1 início 2 V ← {}; N K ← {}; N E ← {}; n ← {}; n ← {}; 3 para p ∈ P faça 4 v ← Selecionar os k vizinhos mais próximos a p em U ; 5 Inserir v em V ; SOUTH-N

SOUTH-N

Semi-supervised OUTlier detection based on Hubness Neighborhood outliers

hubness

fuzzy

outliers

SOUTH-N

outliers

learning from positive and unlabeled data outliers

hubness

outliers

inliers

outliers

1

1

n

outliers

j

inliers

j

cada instância

i

j

j

j

j

6 7 8 9 10 11 12 13 14

para cada lista de vizinhos vj ∈ V faça para cada vizinho i ∈ vj faça

n ← Calcular N (i); Inserir n em n ; Inserir o par (v , n ) em N K ; Ordenar de forma decrescente N K ; Inserir em N E os q% elementos iniciais v de N K ; i

k

i

j

j

retorna N E m

j

j

O primeiro passo do algoritmo consiste em calcular a vizinhança v de cada amostra positiva p ∈ P informada como entrada. Para tanto são selecionados os k vizinhos mais próximos de cada amostra j

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

124

j

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

SOUTH-N: um método para a detecção semissupervisionada de

outliers

·

em dados de alta dimensão

5

positiva p no conjunto de amostras não rotuladas U , sendo U = X − P (veja a linha 3 do Algoritmo 1). Após a seleção da vizinhança v de cada amostra positiva p , o próximo passo consiste em realizar o cálculo da pontuação (N ) (conforme explicado na Seção 2.2), para cada elemento da lista de vizinhança v e armazenar na lista N K o par (v , n ), sendo n a pontuação de v (veja a linha 6 do Algoritmo 1). É importante destacar que cada amostra positiva p possui uma lista N K com a pontuação dos elementos de sua vizinhança. Por m, para a denição da saída do algoritmo, ou seja a lista N E, as listas N K são ordenadas de forma decrescente, com relação ao valor da pontuação , e uma combinação dos q% primeiros elementos de cada lista N K de cada amostra positiva (p ) são adicionados a N E. Ou seja, os q% primeiros elementos não rotulados da vizinhança de cada amostra positiva (p ) são retornados. 3.2 Segunda Fase A segunda fase do método emprega o algoritmo ( ) proposto em [Xue et al. 2010]. O necessita que sejam informados como entrada exemplos de amostras positivas e negativas o que diculta a sua usabilidade na prática, uma vez que impõe a necessidade de supervisão para ambos tipos de amostras [Daneshpazhouh and Sami 2015]. Entretanto o método proposto auxilia a mitigar esse problema, pois necessita que sejam informadas como entrada apenas poucas amostras positivas. Os detalhes do Algoritmo são apresentados a seguir. Segundo [Xue et al. 2010], o algoritmo tem como objetivo obter uma matriz U = {u |1 6 i 6 n, i 6 k 6 c}, na qual os valores para u , variando entre 0 e 1, indicam uma associação da i-ésima instância ao k-ésimo grupo, considerando um conjunto de dados X = {x , ..., x } com as primeiras l < n instâncias rotuladas com valores zero ou um, sendo que o valor zero indica se a instância é um e o valor um indica o contrário. Para isso, se o somatório da linha da matriz for igual à zero isso signica que a instância em questão não tem chances de permanecer em nenhum agrupamento, caso contrário, a instância pertence a pelo menos um grupo. Essa matriz é denida a partir de um problema de otimização resolvido pelo conforme apresentado na Equação 1 no qual o objetivo é encontrar a minimização dos termos. No primeiro termo da Equação 1, o somatório mais interno representa o somatório das distâncias de cada elemento i do conjunto de dados em relação ao centróide do grupo, ponderado sobre u . A minimização desse termo garante um melhor agrupamento, pois ele representa o erro. O segundo termo é responsável por garantir que o resultado não possua um número muito alto de , sendo o parâmetro γ responsável por ponderar a quantidade de . Assim, para garantir a minimização, o somatório interno deve ser igual a n. Por m, o terceiro termo, ponderado pelo parâmetro γ , garante que as instâncias já rotuladas mantenham os seus rótulos durante o processo. Ou seja, se y é zero ( ) o somatório da linha da matriz também deve ser zero ou um para as instâncias previamente rotuladas como amostras positivas. Assim, é gerada uma matriz nxc (Número de amostras X ), na qual o elemento E da matriz representa o quanto a amostra i (linha) tem a probabilidade de pertencer ao j (coluna). j

j

hubness

j

k

j

j

j

hubness

j

j

j

hubness

hubness

j

j

SOUTH-N

FRSSOD

outlier detection

Fuzzy rough semi-supervised

FRSSOD

SOUTH-N

FRSSOD

FRSSOD

ik

n·c

ik

fuzzy

1

n

outlier

FRSSOD

ik

outliers

1

outliers

2

outlier

i

fuzzy

Cluster

ij

cluster

minJm (u, v) =

n X c X

(uik )m d2ik + γ1

i=1 k=1

n−

n c X X i=1

!m !

(uik )

k=1

+ γ2

l X i=1

yi −

c X

(uik )

k=1

!2

(1)

Quando o algoritmo é encerrado, a partir dessa representação nal da matriz, podemos extrair os apenas procurando as linhas da matriz em que o seu somatório é igual a zero.

outliers

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

125

5th KDMiLe – Proceedings

6

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L. A. Silva et. al.

Table I. Conjuntos de dados considerados nos experimentos.

Conjuntos de Dados

Breast Cancer Wisconsin Cardiotocography- CTG Ecoli Forest type mapping Glass Identication Ionosphere New Thyroid Parkinsons Spambase Statlog (Vehicle Silhouettes) SPECTF Heart Synthetic Wine Zoo

Instâncias

Classes

699 2.126 336 326 214 351 215 195 4.601 946 267 500 178 101

2 10 8 4 7 2 3 2 2 4 2 3 3 7

Dimensões 10 10 8 27 10 32 6 23 57 18 44 3 13 17

Nro de 458 1950 307 289 163 225 150 147 1.813 720 212 485 130 91

Inliers

Nro de 241 176 29 37 51 126 65 48 2.788 226 55 15 48 10

Outliers

4. Esta seção descreve os experimentos que conrmam a ecácia do método em diversos conjuntos de dados. Considerando que o é um método semissupervisionado utilizou-se a métrica AU C para avaliar a ecácia dos métodos comparados. Os experimentos foram realizados em um desktop Dell XPS-8700 com processador Intel QuadCore i5-4430 [email protected], 8GB de memória RAM e disco rígido SATA-III 1TB 7200 RPM, utilizando o compilador GNU gcc sobre Microsoft Windows 8 64-bits. Para esses experimentos foram selecionados 15 conjuntos de dados cujos detalhes são apresentados na Tabela I. Desses conjuntos de dados, 14 são compostos por dados reais e foram obtidos do Repositório [Lichman 2013] e 1 é composto por dados sintéticos gerados pela função " "na linguagem R conforme explicado em [Ripley 2017]. Foram escolhidos três métodos de detecção de para a comparação com o método apresentado neste artigo: , e .O foi escolhido por ser a abordagem presente na literatura que mais se assemelha ao método proposto. Ele emprega uma estratégia semissupervisionada baseada no aprendizado a partir de poucas amostras positivas e amostras não rotuladas e também utiliza o algoritmo em sua segunda fase. Os outros dois métodos, ( )[Breunig et al. 2000] e ( ) [Kriegel et al. 2008], são baseados na abordagem não supervisionada e foram selecionados pelo fato de serem referências para a detecção de em dados de alta dimensionalidade. Considerando que tanto o método quanto o método adotam o algoritmo em suas segundas fases, a execução dos mesmos exige a variação dos seguintes parâmetros de entrada: γ , γ e . Portanto para a denição da melhor combinação de parâmetros para cada um dos algoritmos essa variação foi feita para ambos os métodos da seguinte maneira: o γ variou de 0, 001 até 0, 1, com incremento de 0, 001; o γ de 0, 1 até 1, com incremento de 0, 1; e o  começa em 1 e vai até 3, com incremento de 0, 1. Para o cálculo da pontuação o método considerou o k igual a 5. Com base nas informações de rótulos dos conjuntos de dados, foram selecionadas 30% das instâncias rotuladas como de cada conjunto de dados para serem informadas como amostras positivas para os métodos semissupervisionados e . Os resultados apresentados na Tabela II foram obtidos considerando a média de 200 execuções de cada algoritmo. Analisando esses resultados é possível constatar que o apresentou bons resultados para conjuntos de dados com diferentes quantidades de classes e números de dimensões, por exemplo, o de baixa dimensionalidade e o de alta dimensionalidade. De forma geral, considerando todos os conjuntos de dados, o apresentou melhor ecácia do EXPERIMENTOS

SOUTH-N

SOUTH-N

UCI (University of California

Irvine) Machine Learning mvrnorm

outliers

SSODPU

LOF

ABOD

SOUTH-N

SSODPU

FRSSOD

Outlier Factor

ABOD

LOF

Angle-based Outlier Detection

outliers

SOUTH-N

SSODPU

FRSSOD

1

2

1

2

hubness

SOUTH-N

outliers

SOUTH-N

SSODPU

SOUTH-N

New Thyroid

Spambase

SOUTH-N

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

126

Local

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

SOUTH-N: um método para a detecção semissupervisionada de

outliers

em dados de alta dimensão

·

7

Table II. Resultados dos experimentos, considerando a AUC. Os valores sublinhados destacam os melhores desempenhos.

Conjunto de Dados

SOUTH-N 0,792

Breast Cancer Wisconsin Cardiotocography- CTG Ecoli Forest type mapping Data Set Glass Identication Ionosphere Modied Wisconsin New Thyroid Parkinsons Spambase Statlog (Vehicle Silhouettes) Synthetic Wine Zoo

SSODPU

0,623 0,745

0,709 0,322 0,544 0,468 0,631 0,596 0,71 0,824 0,425 0,349 0,422 0,574 0,381 0,35

-

0,166 0,107

0,5

0,726 0,611

0,703 0,638

0,82 0,963 0,774 0,592 0,5 0,65

µD (Diferença da médias) sD (Diferença do desvios padrão)

ABOD 0,618 0,535 0,520

0,739 0,432 0,568 0,499 0,492 0,639 0,580 0,434 0,551 0,559 0,495 0,141 0,155

LOF 0,522

0,621 0,652 0,604 0,586

0,642 0,807 0,669 0,583 0,512 0,486

0,723 0,516 0,711 0,071 0,118

que os demais em 10 conjuntos de dados do total de 14 analizados. Para comprovar a superioridade da ecácia do método em relação aos demais foi realizado o Teste T Pareado [Ott and Longnecker 2010]. O teste estatístico Teste T Pareado realiza a comparação do método com cada um dos métodos selecionados como linha de base considerando as seguintes hipóteses de teste: H (Hipótese Nula): µ = µ - µ = 0. Os dois algoritmos apresentaram a mesma ecácia nos experimentos realizados; H (Hipótese Alternativa): µ > 0. O algoritmo apresentou ecácia superior ao algoritmo comparado nos experimentos realizados. Para aplicar o Teste T Pareado considera-se as diferenças (D) entre os valores de cada par de medidas AUC da Tabela II como: T = (D − µ )/(s /√N ), onde N é o número de conjuntos de dados. Considerando α = 0, 05, encontramos na tabela t de com N − 1 = 13 graus de liberdade o valor crítico t = 1, 771. Como os valores T para a comparação do com o (5, 8144), (3, 3993) e (2, 2623) são superiores ao valor crítico t e o p − valor para a comparação do com o (3.015e − 05), (0, 0023) e (0, 0207) são inferiores a 0, 05 é possível rejeitar a hipótese nula H e concluir com 95% de conança que existe diferença signicativa entre os resultados dos algoritmos. 5. O trabalho apresentado neste artigo contribuiu com a criação de um novo método de detecção semissupervisionada de compatível com dados de alta dimensionalidade. Dentre os pontos positivos desse método está o fato de que ele lida com a alta dimensionalidade usando o conceito e, por isso, evita possíveis perdas de informação por que considera todas as dimensões dos conjuntos de dados em análise. Além disso, sua semissupervisão exige poucos exemplos de treinamento. Os resultados dos experimentos realizados com diferentes conjuntos de dados ajudam a corroborar a armação do bom desempenho do método quando comparado com outros métodos da literatura cientíca. SOUTH-N

SOUTH-N

0

D

a

1

2

SOUTH-N

D

D

calc

d

Student

0,05

SSODPU

ABOD

SOUTH-N

SOUTH-N

calc

LOF

0,05

SSODPU

ABOD

LOF

0

CONCLUSÃO

outliers

hubness

SOUTH-N

REFERENCES Aggarwal, C. C.

2015.

Data mining: The Textbook. Springer International Publishing Switzerland, New York, NY, USA,

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

127

5th KDMiLe – Proceedings

8

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L. A. Silva et. al.

Outlier detection for high dimensional data. SIGMOD Rec. 30 (2): 3746, 2001. A novel approach for credit card fraud detection. IEEE/INDIACom, Washington, DC, USA, pp. 811, 2015. Blanchard, G., Lee, G., , and Scott, C. Semi-supervised novelty detection. Journal of Machine Learning Research 11 (99): 29733009, 2010. Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof: Identifying density-based local outliers. SIGMOD Rec. 29 (2): 93104, May, 2000. Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 41 (3): 15:115:58, July, 2009. Daneshpazhouh, A. and Sami, A. Entropy-based outlier detection using semi-supervised approach with few positive examples. Pattern Recognition Letters vol. 49, pp. 7784, 2014. Daneshpazhouh, A. and Sami, A. Semi-supervised outlier detection with only positive and unlabeled data based on fuzzy clustering. International Journal on Articial Intelligence Tools 24 (03): 1550003, 2015. El-Kilany, A., Tazi, N. E., and Ezzat, E. Semi-supervised outlier detection via bipartite graph clustering. IEEE/AICCSA, Agadir, Morocco, pp. 16, 2016. Flexer, A. An empirical analysis of hubness in unsupervised distance-based outlier detection. IEEE Transactions on Image Processing/ICDMW 16 (16651425): 716723, Dec, 2016. Gaspar, J., Lopes, F., and Freitas, A. An analysis of hospital coding in portugal: Detection of patterns, errors and outliers in female breast cancer episodes. CISTI, Chaves, Portugal, pp. 16, 2011. Gogoi, P. and Borah, B. Supervised anomaly detection using clustering based normal behaviour modeling. International Journal of Advances in Engineering Sciences 1 (1): 1217, Jan, 2011. Guler, P., Temizel, A., and Temizel, T. An unsupervised method for anomaly detection from crowd videos. SIU, Haspolat, Turkey, pp. 14, 2013. Hassanzadeh, R., Nayak, R., and Stebila, D. Analyzing the eectiveness of graph metrics for anomaly detection in online social networks. In WISE, X. S. Wang, I. F. Cruz, A. Delis, and G. Huang (Eds.). Vol. 7651. Springer, Cham, pp. 624630, 2012. Hawkins, D. M. Identication of outliers. Chapman & Hall, United Kingdom, 1980. Heylen, R., Parente, M., and Scheunders, P. Estimation of the intrinsic dimensionality in hyperspectral imagery via the hubness phenomenon. LVA/ICA, Grenoble, France, 2017. Jiang, S.-Y. and Yang, A. Framework of clustering-based outlier detection. Vol. 1. FSKD, Tianjin, China, pp. 475479, 2009. Kriegel, H.-P., S hubert, M., and Zimek, A. Angle-based outlier detection in high-dimensional data. ACM SIGKDD, New York, USA, pp. 444452, 2008. Lichman, M. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013. Accessed: May, 2017. Lu, W., Cheng, Y., Xiao, C., Chang, S., Huang, S., Liang, B., and Huang, T. Unsupervised sequential outlier detection with deep architectures. IEEE Transactions on Image Processing 26 (9): 43214330, Sept, 2017. Montes, M. C. Depth-based outlier detection algorithm. HAIS, Salamanca, Spain, pp. 122132, 2014. Ott, R. L. and Longnecker, M. An Introduction to Statistical Methods and Data Analysis. Cengage Learning, Boston, Massachusetts, EUA, 2010. Radovanovic, M., Nanopoulos, A., and Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. vol. 11, pp. 24872531, Dec., 2010. Ripley, B. The r project for statistical computing. https://cran.r-project.org/web/packages/MASS/MASS.pdf, 2017. Accessed: May, 2017. Aggarwal, C. C. and Yu, P. S.

Agrawal, A., Kumar, S., and Mishra, A. K.

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.

Samet, H.

Graph-based coupled behavior analysis: A case study on detecting collaborative manipulations in stock markets. IJCNN, Brisbane, Australia, pp. 18, 2012. Tomasev, N., Radovanovic, M., Mladenic, D., and Ivanovic, M. The role of hubness in clustering highdimensional data. tkde 26 (3): 739751, March, 2014. Toma²ev, N. and Buza, K. Hubness-aware knn classication of high-dimensional data in presence of label noise. Neurocomput 160 (C): 157172, July, 2015. Xue, Z., Shang, Y., and Feng, A. Semi-supervised outlier detection based on fuzzy rough c-means clustering. Math. Comput. Simul. 80 (9): 19111921, May, 2010. Zheng, L., Shen, C., Tang, L., Li, T., Luis, S., Chen, S.-C., and Hristidis, V. Using data mining techniques to address critical information exchange needs in disaster aected public-private networks. In Proceedings of the 16th kdd. KDD '10. ACM, Washington, USA, pp. 125134, 2010. Song, Y. and Cao, L.

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

128

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Learning Probabilistic Relational Models: A Simplified Framework, a Case Study, and a Package L. H. Mormille, F. G. Cozman Universidade de São Paulo, Brazil [email protected][email protected] Abstract. While most statistical learning methods are designed to work with data stored in a single table, many large datasets are stored in relational database systems. Probabilistic Relational Models (PRM) extend Bayesian networks by introducing relations and individuals, thus making it possible to represent information in a relational database. However, existing methods that learn PRMs from data face some nontrivial challenges in the way relations are extracted. We propose a novel approach to learn the structure of a PRM. We also describe a package in the R language to support our learning framework, and we apply it to a real, large scale scenario combining citizens, companies and location data. Categories and Subject Descriptors: G.3 [Probability and Statistic]: Probabilistic algorithms; I.2.6 [Artificial Intelligence]: Learning Keywords: relational models, PRM, Bayesian networks, machine learning

1.

INTRODUCTION

Most large data collections are stored in relational database systems consisting of multiple tables; however, most data mining techniques are only applicable to a single table. Multi-relational data mining (MRDM) focuses on the search for patterns across multiple tables (relations) of a database [Džeroski 2003], and many techniques have been proposed in that context. Indeed, relational models lead to a deeper understanding of the relations held within domains, and can be used for exploratory analysis, predictions and complex inferences. Despite the success of Bayesian networks in a wide variety of real-world and research applications, they cannot be used to model domains where we might encounter several entities in different configurations [Koller 1999]. Probabilistic Relational Models (PRMs) are an extension of Bayesian Networks, introducing the concepts of objects and its properties, and the relations held between them, specifying a template for a probability distribution [Getoor and Taskar 2007]. Thus, PRMs offer a rich relational structure, allowing a property of an object to depend probabilistically, not only on the properties of that given object, but also on properties of other related objects [Getoor et al. 1999]. However, learning a PRM from relational data is a more complex task than learning a Bayesian Network from “flat” data. There are three main difficulties that arises while learning a PRM. The first one is establishing what are the legal dependency structures for a given domain: we must avoid cycles in the structure. The second difficulty is how to score the possible legal structures. And the third challenge is to search for possible structures [Friedman et al. 1999]. Given the complications often faced while learning a PRM, this paper aims to: (1) propose a new method for learning probabilistic relational models; (2) apply it in a real large-scale problem; and (3) report on a software package we have built to apply our method. Taken together, these contributions

c Copyright 2017 Permission to copy without fee all or part of the material printed in KDMiLe is granted provided that the copies are not made or distributed for commercial advantage, and that notice is given that copying is by permission of the Sociedade Brasileira de Computação. Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

129

5th KDMiLe – Proceedings

2

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L. H. Mormille and F. G. Cozman

should be valuable to researchers and practitioners interested in dealing with large relational datasets. Section 2 presents a quick summary on probabilistic relational models. Section 3 explains the novel method for PRM learning. Section 4 reports on the case study and an R package produced by the authors. And Section 5 briefly presents some conclusions. 2.

THE RELATIONAL MODEL FRAMEWORK

This section offers a review of PRMs, their semantics and the challenges in learning them. We use, as a running example, the case study we examine later. A relational domain is usually represented by distinct tables in a database containing attributes and entities. For instance, consider the city of Atibaia, a relatively small town in the state of São Paulo, Brazil. Suppose we are interested in information about its citizens, its companies and its census sectors (territorial units, in which the city is geographically divided), and that we have three different tables, each representing one of the three different classes in our domain, Person, Company and Census_Sector. As we indicate later, the goal of our case study is to learn a PRM structure and parameters with classes in our domain, so as to predict the social class of people in Atibaia. The vocabulary of a relational model consists of a set of classes X1 , ..., Xn , and a set of relations R1 , ..., Rm . Every class in the domain has a set of attributes A(Xi ), and every attribute Aj ∈ A(Xi ) has a space of possible values V (Aj ), being that this vocabulary defines a schema for the relational model. One attribute A of a class X, is referred to as X.A. If X.A has a value that is fully determined, such as a name or and identification number, it is labeled as a fixed attribute. The other attributes are probabilistic ones [Friedman et al. 1999]. The logical description of the domain is called relational schema, and it shows how different classes relate to each other, through what is called reference slots. The set of reference slots of X is denoted R(X), while X.ρ is used to refer to the reference slot ρ of X. A reference slot ρ in X can be interpreted as an attribute of X that is also a foreign key for another class. And an instance I of a schema is an interpretation of the relations between the classes in the domain, and it specifies, for a set of objects, a class, and a value for each of it’s attributes. To continue our running example, Figure 1(a) shows the schema for our Atibaia domain. A central concept in a relational model is the relational skeleton σ, defined as a partial specification of an instance of a schema [Friedman et al. 1999]. It specifies for a set of objects Oσ (Xi ), a class, the value of the fixed attributes within this objects, and the relations held between them, leaving only the probabilistic attributes unspecified. Figure 1(b) represents an example of a relational skeleton for our Atibaia domain. In our domain of study, the Person class has, among others, the attribute Social_class, and that the value space for Person.Social_class is {A, B, C, D}. Also, the Person.Census_Sector_ID attribute is a reference slot of the Person class, with range type Census_Sector. The same applies for the class Company. Every object in Census_Sector is associated with n objects in Person, that are the citizens who live in the sector, and m objects in Company, that are the companies located inside the area comprised by the census sector. A PRM consists of a dependency structure S, and the conditional probability distribution θS associated with the dependency structure. Just like a Bayesian network, the dependency structure of a PRM is defined by associating a set of parents Pa(X.A) with each attribute X.A. However, in a PRM, an attribute X.A can have as parent, either an intra-class attribute, denoted X.B, or an inter-class attribute, denoted as X.τ.B, where τ is a slot chain, representing the set of objects that are τ -relatives of an object x ∈ X. When an object x relates with an object of another class through a slot chain, unless the relation is guaranteed to be singled-valued, x.a depends on a set of objects X.τ.B, and, when that occurs, a method for representing these complex dependencies is necessary. This is attained using an aggregation function. Some notable aggregation functions are the mode, if the attribute is Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

130

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Learning Probabilistic Relational Models

Fig. 1.

·

3

(a) The relational schema for the Atibaia domain. (b) The relational skeleton for the Atibaia domain.

categorical, the mean, if the attribute is continuous, the median, maximum or minimum, if the attribute is ordered, or the cardinality. By using such functions, it is possible to return a summary of a multiset of values [Getoor and Taskar 2007]. An aggregation function γ takes a multiset of values, and returns a summary of it, allowing X.A to have γ(X.τ.B) as a parent. For our Atibaia city domain, a legal PRM structure, with the required aggregations indicated can be seen in Figure 2. Note that, since objects from either class Company and Person can only be associated with one object from the Census_sector class, no aggregations are required when an attribute on Census_sector is the parent of an attribute from either Company or Person. Learning a PRM from data can be quite difficult. The most imposing challenge is to learn the dependency structure. First, making sure that all the dependencies established are acyclic is a nontrivial task. The best scenario is when the skeleton σ is acyclic. However, in some cases, if there is a cycle in the skeleton σ, one needs to guarantee that cycles do not happen at an object level. Also, evaluating and scoring concurrent structures is also challenging; the usual approach it to adapt Bayesian model selection. And finally, even after choosing a method to assure that a structure is legal and score it, one must go over candidate models, comparing them. The number of possible structures may be quite large, and the cost of the search operations is also high, making it necessary to associate the search method with an heuristic approach, limiting the space of possible structures, usually by associating a limited set o possible parents for each attribute. 3. THE PROPOSED METHOD Our goal here is to search for a PRM, using some score typically used to learn Bayesian networks. As noted already, the search space is too big, so the process is quite complex and costly in practice. We propose an alternative approach to learning PRMs, by restricting the space of possible structures in ways that make sense for practical problems. First, we do not allow the dependency structure to have cycles, not only at the attribute level, but at class level as well. For instance, structures such as X.A → Y.B → X.C are not allowed. A second assumption is that we have attributes of interest that belong to a distinguished class, that we refer to as the main class. Note that, whenever an object x ∈ X is linked to a set of objects {y1 , ..., yi } ∈ Y from another class, an attribute of the type X.A can only have as parent an attribute of the type Y.B by using an aggregation function γ. The space of possible structures will be then further reduced by not allowing any attributes from classes, other than the main class, to have inter-class attributes whose Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

131

5th KDMiLe – Proceedings

4

·

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L. H. Mormille and F. G. Cozman

slot chain is not singled-valued as parents, since it would require an aggregation. That determines that the only edges from aggregation functions will be directed to the main class. Given a relational skeleton σ, all the required aggregations for our restricted space of structures will be computed in advance, and stored in a table denoted master table. At first, the master table will be a copy of the main class table. Then, for every object xi in the main class X, the attributes of its τ -relative objects, whose links are given by the value of the reference slot X.ρ, will be aggregated using a proper aggregation function γ, and the result, will be returned to it’s respective row in the master table. However, if the slot chain is guaranteed to be singled-valued, that is, if every object xi in the main class relates to only one object of another class, the use of an aggregation function γ is unnecessary. For instance, in the Atibaia domain, the main class is Person, because Person.Social_class is the variable of interest. That means the attributes from objects in Company and Census_sector that are τ -relatives to an object in Person will be aggregated. However, every object in the class Person, is associated with one, and only one, object in the class Census_sector, meaning that the slot chain that links them is singled-valued and the use a an aggregation function γ is unnecessary in that particular case. The values of attributes on Census_sector are simply replicated on the rows where their corresponding keys appears in the master table. Now that the data is “flattened” into a master table, the challenges of defining a proper structure scoring function, choosing a search method for possible legal structures, and comparing them via score, can be solved by traditional Bayesian network approaches, and there are several popular algorithms that addresses these issues. To make sure all adopted restrictions are enforced during learning, a black list of edges must be assembled. That list will specify which arrows cannot exist in the model, maintaining a coherent structure. Popular algorithms for Bayesian network learning typically allow for such lists of forbidden edges. For instance, in our Atibaia domain the restriction over the set of structures forbids attributes from class Person to parent attributes from Census_sector and Company, and also forbids attributes from class Company to parent any attribute on Census_sector. Then, a Bayesian network structure, whose representation is a directed acyclic graph (DAG) describing a set of independencies [Koller and Friedman 2009], can be learned with any package of choice. For our domain of study, we used the bnlearn package; the search algorithm selected to complete this task was the Hill Climbing, and the selected scoring function was the Dirichlet posterior density based on Jeffrey’s prior, returning a DAG analogous to a legal PRM structure. For instance, in the Atibaia domain it is possible to observe on the resulting dependency structure that the only inter-class arrows respect those restrictions, thus generating a network isomorphic to a legal PRM structure for the domain, as shown in Figure 2. It is worth noting that, in this example, on the best structure learned by this method, all parents of our target variable are from foreign classes. The PRM, just like a Bayesian network, is associated with a conditional probability distribution (CPD) or local probabilistic model. The CPD for Xi , given its parents P a(Xi ) in the graph, is P (Xi |P a(Xi )), and it captures the conditional probability of the random variable, given its parents in the graph [Koller et al. 2007]. As described by Getoor [2001], in the same way as with Bayesian networks, the joint distribution over these assignments can be factored by taking the product, over all x.A, of the probability in the CPD of the specific value assigned by the instance to the attribute given the values assigned to its parents. The formal expression can written as follows: Y Y Y Y Y P (I|σ, S, θs ) = P (Ix.A )|IP a(x.A) ) = P (Ix.A )|IP a(x.A) ). x∈σ A∈A(x)

Xi A∈A(x) x∈σ(Xi )

Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

132

5th KDMiLe – Proceedings

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

Learning Probabilistic Relational Models

·

5

Person Age range

Presumed income

Rural producer

Consumption index

Public server

Company owner

Government assisted

Average income

Social class

Operability

Range of residents

Company age

Range of employees

House hold

Presumed revenue

Company Type

Census sector Fig. 2. The PRM structure, learned using the proposed method. The green arrows represent intra-class relations. The yellow arrows represent inter-class relations where the slot chain is singled valued. And the orange arrows represents the inter-class relations where an aggregation function was used (in our case, the mode).

4.

A CASE STUDY, AND A PACKAGE FOR PRM LEARNING

In this section we apply our proposed method to a real large-scale problem. To do so, we have implemented a package in the R language. After we present some details of our case study, we describe our freely available package and apply it. 4.1 The case study The domain of our case study is a small town named Atibaia, in the state of São Paulo, Brazil, and data was gathered from different types of objects within the city. The first class of objects is Person, and every object in that class, represents a citizen of Atibaia and its attributes. The second class in our domain is Company, and the objects of that class, are the business located in Atibaia. And the third class represents small territorial units that comprise the city of Atibaia, named Census_sector. It can be noted that, every object in Census_ sector can have n people related to it, which are the citizens who lives inside that sector, and m related companies, which are the business located inside that sector. However, an object in Person and an object in Company can only be associated with one, and only one, census sector. The goal is to explore the relations between those classes, so as to create a model that uses their attributes and to infer the social class of objects in Person. The data of the first two classes (Person and Company) was kindly provided by Serasa Experian. All the entities on both classes are fully anonymised, and no personally identifiable attribute was provided, only a hashed key, for every person or company on the tables. Another attribute, present in both these classes, is the Census Sector where a person lives, or a company is situated. That attribute is also a foreign key for the third class in our domain, the Census_Sector. The remaining attributes will be probabilistic attributes, and express information about distinct aspects of a given person or company. The data of the Census_Sector can be found at the IBGE (Brazilian Institute Symposium on Knowledge Discovery, Mining and Learning, KDMILE 2017.

133

5th KDMiLe – Proceedings

·

6

October 02-04, 2017 – Uberlˆ andia, MG, Brazil

L. H. Mormille and F. G. Cozman

Fig. 3. All census sectors in Atibaia plotted in a map using the software QGIS 2.18.7 Las Palmas. The color of each sector represents the range of residents inside the sector.

of Geography and Statistics) website (http://www.ibge.gov.br/home/). More than a thousand different variables produced on the last census realized in Brazil, on year 2010, are available to the general public. The Person class, originally contained 110816 observations and 27 variables. However, a previous analysis detect that some of variables had high collinearity, so we discarded some of these variables. Other variables were also dropped, for they had a very high incidence of missing values. In the end, 10 variables of the class Person were selected for this study, and 8 of them were probabilistic attributes. The Company class, originally had 20162 observations and 9 variables. For the same issue of the Person class, only 6 variables were included in the analysis, and 4 of them were probabilistic attributes. The city of Atibaia contains 327 census sectors, and a dataset with 5 variables was assembled with data available at the IBGE website. A map with the census sectors in Atibaia can be seen in Figure 3; the color of the layer represents the number of citizens living inside a given sector. 4.2

The package

Vanilla.prm is a package based on the above described method. The package supports domains with up to three classes with categorical attributes, and it performs all required aggregations using the function mode. The package is available in github, and install as any regular R package (using devtools).1 For processing, the user must ensure that all the variables on the tables are categorical, and that all keys and foreign keys are named the same way on every table on which it appears. If any variable does not fit this criteria, the R language provides tools to manipulate the data, in order to make it compliant with the requirements of the package. Once the criteria for using the package are attended, the user should store the column names of all keys and foreign keys in a vector, which can be done, as an example, using the command: R> key.names master_table dag fit

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.