tesis doctoral - CiteSeerX [PDF]

Programa de Doctorado en Tecnologías de la Información Aplicadas. Escuela Politécnica. Enrique Puertas Sanz. Dirigido

45 downloads 52 Views 7MB Size

Recommend Stories


Tesis doctoral [PDF]
Agradezco los ánimos y vítores a los familiares y amigas/os que me han brindado el apoyo moral para que pudiera ..... Lemley, 2000; Nichols y Glenn, 1994; Paolucci et al., 2014; Richards y Scott, 2002; Rooks et al., 2007; Sañudo, Carrasco, ..... t

TESIS DOCTORAL
Where there is ruin, there is hope for a treasure. Rumi

TESIS DOCTORAL
The happiest people don't have the best of everything, they just make the best of everything. Anony

TESIS DOCTORAL
Seek knowledge from cradle to the grave. Prophet Muhammad (Peace be upon him)

TESIS DOCTORAL
In every community, there is work to be done. In every nation, there are wounds to heal. In every heart,

TESIS DOCTORAL
You miss 100% of the shots you don’t take. Wayne Gretzky

TESIS DOCTORAL
In the end only three things matter: how much you loved, how gently you lived, and how gracefully you

TESIS DOCTORAL
The best time to plant a tree was 20 years ago. The second best time is now. Chinese Proverb

TESIS DOCTORAL
Almost everything will work again if you unplug it for a few minutes, including you. Anne Lamott

TESIS DOCTORAL
If you are irritated by every rub, how will your mirror be polished? Rumi

Idea Transcript


TESIS DOCTORAL SERIE INGENIERÍA

Ingeniería de Atributos y Minería de Datos para la Recuperación de Información con Adversario Programa de Doctorado en Tecnologías de la Información Aplicadas

Escuela Politécnica

Enrique Puertas Sanz Dirigido por:

Dr. José María Gómez Hidalgo Madrid, 2013

2

Esta tesis es un compendio de los siguientes trabajos publicados en revistas indexadas dentro del campo de la Inteligencia Artificial: •

Puertas Sanz, E., Gómez Hidalgo, J. M., Carrero, F., Buenaga, M.. (2009). Web Content Filtering. Advances in Computers – Elsevier Academic Press, (Vol. 76, 257-306).



Puertas Sanz, E., Gómez Hidalgo, J. M., Cortizo Pérez, J. C. (2008). Email Spam Filtering. Advances in Computers. Elsevier Academic Press, (Vol. 74 pp.45-114)



Gómez Hidalgo, J. M., Bringas, G. C., Puertas Sanz, E., García, F. C. (2006). Content Based SMS Spam Filtering. In Proceedings of the 2006 ACM Symposium on Document Engineering, ACM Press (pp. 107-114).



Gómez Hidalgo, J. M., Puertas Sanz, E., Carrero García, F., Buenaga Rodríguez, M. (2003). Categorización de texto sensible al coste para el filtrado de contenidos inapropiados en internet. Procesamiento del Lenguaje Natural (Vol. 31, pp. 1320).

3

4

INFORME Y AUTORIZACIÓN DEL DIRECTOR PARA PRESENTAR LA TESIS DOCTORAL En los últimos 20 años, se han ido aplicando con creciente interés y resultados cada vez más positivos las técnicas de minería de datos al ámbito de la seguridad de las Tecnologías de la Información. Problemas como la monitorización de redes, el descubrimiento de anomalías e intrusiones, y otros muchos, han sido abordados por medio de sistemas capaces de aprender a partir de cantidades ingentes de datos y de extraer reglas de clasificación que permiten la predicción de nuevos eventos de seguridad. En este marco, las tecnologías concretas de minería de texto, es decir, de descubrimiento de conocimiento en bases de datos textuales, cobran gran interés, ya que existen numerosos problemas de seguridad cuyo formato fundamental de información es el texto. En esta tesis se abordan tres de esos problemas, como son el filtrado de correo basura, el filtrado de contenidos Web y el filtrado de mensajes SMS basura. Aunque se trata de tres problemas distintos, tanto en la dimensión de sus componentes textuales (mensajes de correo, páginas Web y mensajes cortos), el autor ha sido capaz de dar un enfoque unificado para el tratamiento de los mismos, basándose en la aplicación de técnicas de análisis estadístico sobre el contenido de los textos objetivo. Las dos componentes fundamentales del trabajo son la aplicación de técnicas de ingeniería lingüística y el desarrollo de un método específico de evaluación más adecuado que los precedentes para este tipo de problemas. Éste último método de evaluación se ha convertido en un estándar en el campo científico de la seguridad, y se ha utilizado en competiciones científicas del más alto nivel, como las

5

Conferencias TREC (Text REtrieval Conferences) para la evaluación de sistemas de filtrado de correo basura.La tesis está avalada por el trabajo realizado en varios proyectos y contratos de investigación, como el proyecto del V Programa Marco POESIA (Public Open-source Environment for a Safer Internet Access), los proyectos PROFIT TEFILA I y II (TÉcnicas de Filtrado basadas en Ingeniería del Lenguaje, Aprendizaje automático y agentes), o el contrato de investigación FISME (Análisis prospectivo de tecnologías de filtrado de spam SMS basado en contenido). Las publicaciones que conforman esta tesis doctoral han sido presentadas en congresos y revistas del más alto nivel, como la conferencia ACM Conference on Document Engineering o la serie Advances in Computers, indexada en los JCR (“Journal Citation Reports”) y que constituye la serie más antigua de publicaciones relacionadas con la informática. El impacto de los trabajos presentados en esta tesis en indudable; por ejemplo, el artículo “Content Based SMS Spam Filtering” ha recibido 67 citas a fecha de la firma, de acuerdo con Google Scholar. El investigador Dr. D. JOSÉ MARÍA GÓMEZ HIDALGO, Director de la Tesis Doctoral “Ingeniería de Atributos y Minería de Datos para la Recuperación de Información con Adversario “, de la que es autor D. ENRIQUE PUERTAS SANZ, AUTORIZA la presentación de la referida Tesis para su defensa y mantenimiento, en cumplimiento del artículo 21 del Real Decreto 1393/2007, de 29 de Octubre, por el que se establece la ordenación de las enseñanzas universitarias oficiales y de acuerdo a las normas reguladoras de estudios de postgrado de la Universidad Europea de Madrid. 28 de Junio de 2013. Fdo.: José María Gómez Hidalgo.

6

AGRADECIMIENTOS La elaboración de esta tesis doctoral ha supuesto un arduo trabajo que no habría sido posible sin el apoyo y la generosidad de mis amigos, familiares y compañeros de trabajo. Gracias a José María Gómez, mi director de tesis, y a Manuel de Buenaga, mi tutor, que me han asesorado, ayudado y empujado en los momentos más complicados. Sin su buen hacer este trabajo no habría sido posible. A mis compañeros de trabajo de la Universidad Europea de Madrid, por su paciencia y colaboración a lo largo de este recorrido. A mis padres y hermanos, por su apoyo y comprensión, y por estar siempre ahí cuando les he necesitado. Y sobre todo, gracias a mi abuela Felisa, por su preocupación cada vez que me veía, preguntando por el avance de la tesis, animándome a continuar con el trabajo.

7

Dedicado a mis padres y hermanos, Juan y Conchi, Patricia, Paco y Juan Hernando

8

ÍNDICE

RESUMEN ............................................................................................. 11 ABSTRACT ............................................................................................. 13 I.

INTRODUCCIÓN.............................................................................. 15 MOTIVACIÓN ......................................................................................... 15 ORGANIZACIÓN ...................................................................................... 21

II.

CATEGORIZACIÓN DE TEXTO .......................................................... 23 TAREAS DE CLASIFICACIÓN DE TEXTO .......................................................... 23 EL PROCESO DE CLASIFICACIÓN DE TEXTO .................................................... 25 INGENIERÍA DE ATRIBUTOS ...................................................................... 28 EVALUACIÓN ......................................................................................... 37

III. LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO ............................ 43 EL PROBLEMA DE LOS COSTES ASIMÉTRICOS Y DESCONOCIDOS .......................... 45 ATAQUES AL CLASIFICADOR ...................................................................... 47 IV.

APLICACIONES ........................................................................... 51

APLICACIÓN AL FILTRADO DE CORREO BASURA ............................................. 51 APLICACIÓN AL FILTRADO DE CONTENIDOS WEB ............................................ 56 APLICACIÓN AL FILTRADO DE SPAM SMS ..................................................... 59 V.

ARTÍCULOS .................................................................................... 63 ARTÍCULO 1: EMAIL SPAM FILTERING ......................................................... 65 ARTÍCULO 2: WEB CONTENT FILTERING .................................................... 137 ARTÍCULO 3: CATEGORIZACIÓN DE TEXTO SENSIBLE AL COSTE PARA EL FILTRADO DE CONTENIDOS INAPROPIADOS EN INTERNET. ................................................ 191

ARTÍCULO 4: CONTENT BASED SMS SPAM FILTERING. ................................ 201

9

VI.

CONCLUSIONES Y RESUMEN DE APORTACIONES ..................... 211

VII.

REFERENCIAS .......................................................................... 213

VIII.

ANEXOS ................................................................................... 221

ESCRITOS DE CONFORMIDAD DE LOS COAUTORES DE LOS ARTÍCULOS ............... 221

10

RESUMEN El creciente uso de Internet ha venido acompañado de numerosas ventajas, pero también de oportunidades para el fraude. Un buen ejemplo de este tipo de abuso lo encontramos en el correo electrónico, una herramienta con indudable valor para la comunicación de las personas, pero que tiene el inconveniente del correo no solicitado (spam). Otros abusos son, por ejemplo, la descarga de páginas web inapropiadas (e.g. pornográficas) en el puesto de trabajo, o el spam enviado a dispositivos móviles. Debido a la naturaleza de índole textual que se maneja en ese tipo de escenarios, éstos han sido abordados normalmente por medio de técnicas de minería de texto, es decir, de descubrimiento de conocimiento en bases de datos textuales. Sin embargo, ese tipo de abusos tienen elemento común que hace que las

tareas

de

minería

de

texto

tradicionales

no

funcionen

correctamente: En todas ellas existe un adversario que intenta degradar la eficiencia de los categorizadores de texto generados por técnicas de aprendizaje automático. En estos casos se habla de tareas de clasificación o categorización (de texto) con adversario, en el que los sistemas de análisis y aprendizaje deben tener presente la existencia de un adversario (por ejemplo, el spammer) cuyo objetivo es degradar la efectividad de los sistemas de clasificación construidos con estas técnicas. En esta Tesis, las dos contribuciones fundamentales del trabajo son la aplicación de técnicas de ingeniería de atributos y el desarrollo de un

11

método específico de evaluación, más adecuado que los precedentes, para este tipo de problemas con adversario. Éste método de evaluación que hemos propuesto en esta investigación se ha convertido en un estándar en el campo científico de la seguridad, y se ha utilizado en competiciones científicas del más alto nivel, como las Conferencias TREC (Text REtrieval Conferences), para la evaluación de sistemas de filtrado de correo basura. Más concretamente, en esta Tesis hemos demostrado que es posible tratar de una manera unificada el proceso más sensible en la Categorización de Texto con Adversario, que es la representación de los textos, usando técnicas de ingeniería del Lenguaje Natural, y realizar una evaluación homogénea para diversas tareas a pesar de los distintos costes, variables, y de los distintas asimetrías en la distribución de las clases.

12

ABSTRACT The boom of the Internet in the last decades has been accompanied by numerous benefits, but also by frauds and scams. A good example of this type of abuse is found in the email, a tool for communication with undoubted value for people, but it has issues because of unsolicited email (spam). Other abuses are, for instance, browsing inappropriate websites (e.g. porn) at the workplace, or the spam messages on mobile devices. Due to the textual nature of the information that is handled in these scenarios, they have been typically addressed using text mining techniques, i.e., knowledge discovery in textual databases. However, such abuses have a common element that makes traditional text mining tasks work improperly: In all of them there is an adversary that tries to degrade the efficiency of text categorizers generated by machine learning techniques. These cases are referred as Adversarial (text) Classification or Categorization, where learning systems should be aware of the existence of an adversary (e.g. spammer) whose objective is to degrade the effectiveness of classification systems built with these techniques. In this thesis, there are two core contributions: application of Language Engineering techniques and the proposal of a specific method for evaluation, more suitable for this type of problems with an adversary. This proposed evaluation method has become a standard in the scientific field of security, and has been used in research competitions of the highest level, such as TREC Conference (Text REtrieval Conferences) for the evaluation of spam filtering systems.

13

More specifically, in this thesis we have shown that it is possible to take care of, in a unified way, the most sensitive task in Adversarial Text Categorization, this is the representation of texts, using techniques from Natural Language Engineering, and a homogeneous evaluation for various tasks despite their costs, that are variable and unknown, and the asymmetric distribution of classes.

14

INTRODUCCIÓN

I.

INTRODUCCIÓN

Motivación Desde hace algunas décadas, y sobre todo desde la popularización de Internet, se ha hecho más y más patente la necesidad de dotar a las personas de herramientas que les ayuden a buscar, organizar y comprender las grandes cantidades de información que hoy se ponen a su alcance. Los motores de búsqueda son herramientas paradigmáticas en la ayuda al usuario de Internet, pues permiten a los usuarios de Internet localizar información distribuida por todo el mundo en décimas de segundo. Más en general, la comunicación por Internet se ha convertido en un negocio millonario, que ha transformado pequeñas empresas en gigantes corporativos de la talla de Google, Amazon o Yahoo!. Pero como en cualquier otro entorno, donde hay negocio, hay fraude y abuso. El primer entorno donde se han visto niveles de abuso sin precedentes es el correo electrónico. Hoy en día, es casi impensable que las empresas no dispongan de correo electrónico para comunicarse con sus clientes y proveedores, o entre sus propios trabajadores. Igualmente se trata de un medio enormemente valioso para la comunicación personal y privada, e incluso, con las propias administraciones públicas, cada vez más electrónicas. El prácticamente nulo coste de envío de correo para el usuario final, junto con la debilidad del protocolo que lo sustenta, lo han convertido en presa fácil de timadores y mafias, que producen correo basura o spam destinado a fines fraudulentos. Las estadísticas más recientes

15

INTRODUCCIÓN

arrojan cifras escalofriantes: más del 80% de los correos electrónicos que circulan por Internet son spam, que incluyen intentos de fraude bancario, cartas en cadena, promoción de productos de dudosa calidad (farmacéuticos, réplicas, etc.), envíos de virus, y un largo etcétera. A pesar de la cantidad de filtros y soluciones existentes, el correo basura es actualmente un problema no resuelto, por múltiples razones técnicas y legislativas. Sin embargo, de entre las medidas más efectivas contra él, cabe destacar los filtros bayesianos [Puertas08]. Esta familia de filtros, que más en general se deberían denominar “basados en aprendizaje”, ha obligado a los emisores de correo basura (spammers) a poner en práctica tácticas específicas y muy elaboradas para evitarlos, lo que prueba su eficacia. Estas herramientas deben su éxito a su capacidad de analizar el contenido de los correos y aprender a distinguir las propiedades que separan el correo basura de los mensajes legítimos. Esencialmente, se trata de sistemas de clasificación de texto basados en aprendizaje, que abordan el filtrado como una tarea de categorización de texto [Sebastiani02]. Esta tarea consiste en asignar documentos (en este caso, mensajes de correo electrónico) a clases predefinidas (correo basura vs. correo legítimo) en función de su contenido. La categorización de texto es una tarea genérica con múltiples aplicaciones, como la catalogación de documentos en bases de datos bibliográficas y bibliotecas (digitales), la clasificación de páginas Web, y muchas más. La categorización se puede realizar de manera manual (como hacen los expertos catalogadores de las bibliotecas) o automáticamente. La técnica que resulta más efectiva para construir filtros automáticos es el

16

INTRODUCCIÓN

Aprendizaje Automático aplicado al texto de los documentos. De este modo, es posible construir un sistema de categorización o “clasificador” de manera automática a partir de una colección de documentos ya clasificados. Igualmente, es posible lograr que el clasificador aprenda a partir de sus errores, lo que ha dado lugar a filtros de correo muy personalizados y efectivos contra el spam. La categorización automática basada en aprendizaje es una técnica fundamental en el amplio campo de la seguridad de contenidos. Ello se debe a que existen múltiples contenidos en distintos entornos que no es posible revisar manualmente, incluyendo en particular los correos basura, las páginas Web con contenidos inapropiados en determinados entornos (pornografía, violencia, etc.), o las páginas Web basura, usadas para promocionar fraudulentamente otras páginas Web (lo que se conoce como Web spam). El denominador común de estos contenidos es que sus autores están determinados a que dichos contenidos lleguen al usuario final, para lo cual están dispuestos a luchar denodadamente contra cualquier medida técnica (incluyendo el filtrado) que lo impida. La razón última es que todos estos contenidos forman parte de negocios fraudulentos multimillonarios. La clasificación de textos en el contexto de la existencia de adversarios no deja de ser un caso particular de la Clasificación de Texto (agrupamiento de entidades textuales en clases coherentes), y más en general, del Descubrimiento de Conocimiento en Bases de Datos Textuales (Knowledge Discovery in Text Databases, KDTD) [Hearst99]; a su vez, éste es un caso particular del Descubrimiento de Conocimiento en Bases de Datos (Knowledge Discovery in Databases,

17

INTRODUCCIÓN

KDD), que con mucha frecuencia se suele identificar con una de sus fases, la Minería de Datos (Data Mining) [Fayyad96]. Desde el punto de vista de la Minería de Datos, la detección de estos tipos de contenidos es un problema de clasificación con adversario (Adversarial Classification) [Dalvi04]. Dado que los contenidos son generalmente textuales o contienen una parte textual, se puede hablar en general de la “clasificación de texto con adversario”. La característica que distingue a estas aplicaciones de otras de categorización, es la presencia de un adversario muy motivado que adapta sus propias técnicas al clasificador con el fin de engañarlo, o hacerlo inservible por sus errores. Cabe reseñar que la nomenclatura “clasificación de texto con adversario” no está aceptada con carácter general en estos problemas, dónde es más frecuente oír referencias a la Recuperación de Información con Adversario (Adversarial Information Retrieval, AIR). Esta tarea consiste únicamente en la detección de Web spam, que es un caso particular de la clasificación de texto con adversario. Sin embargo, es la primera tarea de clasificación de texto en la que se reconoce explícitamente la existencia de un adversario. La clasificación de texto con adversario es el marco general de esta tesis, en la que se describen los elementos básicos de la categorización automática de texto basada en aprendizaje, y como puede ser empleada en tres entornos de seguridad de contenidos con adversario: el filtrado de correo basura, el filtrado de contenidos Web inapropiados, y el filrado de spam SMS.

18

INTRODUCCIÓN

De un modo sintético, la Categorización de Texto Basada en Aprendizaje, y la Clasificación de Textos con Adversario en particular, es un proceso que tiene las siguientes etapas: •

OBTENCIÓN DE DATOS Y MUESTRAS REPRESETANTATIVAS, para crear un conjunto de datos adecuado y representativo para poder realizar la tarea de aprendizaje.



INGENIERÍA DE ATRIBUTOS, que consiste en hacer una representación de cada elemento individual (correo, documento, etc.) de modo adecuado para la tarea de aprendizaje, es decir, como vector de atributos. Los atributos generalmente son las palabras, y los valores son sus pesos en los documentos. Dentro de esta fase se realiza también el paso de mejora de la representación usando selección y/o extracción de atributos, que se hace por ejemplo, asignando un valor a cada atributo en función de su capacidad predictiva y seleccionando los que tengan mayor capacidad, usando métricas como la Ganancia de Información.



APRENDIZAJE, que es la aplicación de algoritmos sobre la colección de datos así representada, con el fin de generar modelos o clasificadores, es decir, sistemas autónomos como reglas, árboles de decisión, etc., que sean capaces de clasificar nuevos ejemplos de textos.



EVALUACIÓN DE LOS CLASIFICADORES con el fin de seleccionar y usar en la práctica los más efectivos.

19

INTRODUCCIÓN

El objetivo de esta Tesis ha sido diseñar y emplear métodos que permitan mejorar la efectividad (precisión) de los clasificadores de texto con adversario, de un modo uniforme a diversas tareas aplicadas, como el filtrado de correo basura, el filtrado Web y el filtrado de spam SMS. En particular, la tesis se centra en dos de los aspectos más críticos en la clasificación de texto con adversario: •

La ingeniería de atributos, consistente en la definición y optimización de los atributos que representan a los textos de cara al aprendizaje. Dado que los textos están escritos en lenguaje natural, se trata de un proceso de ingeniería del lenguaje natural. Este proceso es crítico porque, por un lado, una representación adecuada facilita la labor de la minería de datos, y

por

el

otro,

porque

es

precisamente

en la

representación de los textos donde se han detectado los más frecuentes intentos de los adversarios para derrotar a los clasificadores (e.g. ataques contra la separación en palabras, spam de imágenes, etc.). •

La evaluación de los sistemas de clasificación, fundamental en la minería de textos, y que presenta dificultades inherentes al hecho de que distintas tareas (filtrado de spam vs. filtrado de contenidos Web) poseen distintos costes de error, y diferentes asimetrías en la distribución de clases.

En el caso de la ingeniería de atributos se han diseñado, aplicado y evaluado diferentes técnicas de representación de los textos objetivo (mensajes de correo, páginas Web, mensajes SMS) que tienen en cuenta las características propias de cada problema particular.

20

INTRODUCCIÓN

En lo que respecta a la evaluación, se ha diseñado un método genérico de evaluación que más tarde se ha aceptado como el método más riguroso y estándar de facto para la evaluación en tareas como el filtrado de correo basura en competiciones internacionales.

Organización Este trabajo pretende demostrar que se puede mejorar la eficiencia en las tareas de Categorización de Texto con Adversario mediante el uso de técnicas de Ingeniería de Atributos adaptadas a la tarea, y que éstas se pueden además evaluar de una forma realista. Por este motivo, dedicaremos el siguiente capítulo a presentar la tarea de Categorización de Texto. Empezaremos comentando qué es y para que se utiliza, y cuáles son las fases típicas de las que consta este proceso. En segundo lugar describimos con más detalle el caso de la Categorización de Texto con Adversario, centrándonos en sus particularidades, más concretamente, que la existencia del adversario afecta al proceso de aprendizaje fundamentalmente en la fase de representación de los textos, y que la evaluación es dificultosa por la asimetría de las clases y de los costes de los errores. A continuación describimos las aplicaciones más importantes den la Categorización de Texto con Adversario, que son el filtrado de correo electrónico no solicitado (spam), el filtrado de contenidos inapropiados en internet y el filtrado de spam en dispositivos móviles. Seguidamente se presentan los artículos que conforman esta tesis, indicando su impacto científico así como la aportación de cada uno a la línea general de este trabajo.

21

INTRODUCCIÓN

Terminamos presentando un resumen con las conclusiones y aportaciones generales de la Tesis.

22

CATEGORIZACIÓN DE TEXTO

II.

CATEGORIZACIÓN DE TEXTO

En esta sección revisamos la relación de la categorización de texto con otras tareas, con el fin de aclarar conceptos, y presentamos sus aplicaciones y distintas técnicas de categorización, con énfasis en la utilización del Aprendizaje Automático para construir clasificadores a partir de ejemplos. La Categorización (o Clasificación) Automática de Texto (CT) es la tarea de asignar documentos a un conjunto de categorías predefinidas (llamadas también clases o temas) [Sebastiani02]. Aunque la Categorización Automática de Texto se puede elaborar a mano (definiendo un conjunto de reglas heurísticas), la complejidad de los contenidos no estructurados escritos en lenguaje natural precisa de una elaboración automática de estos sistemas por medio de Aprendizaje Automático (AA). Este enfoque consiste en entrenar a un clasificador de texto con un conjunto de documentos etiquetados a mano, lo cual se ha mostrado tan exacto como la intervención de humanos expertos.

Tareas de clasificación de texto El objetivo de la clasificación de texto es proporcionar la estructura a un

repositorio

desestructurado

de

texto,

facilitando

así

el

almacenamiento, la búsqueda y la navegación [Sebastiani06]. Esta disciplina pertenece al extenso campo de la Minería de Textos (MT) [Hearst99] o, más propiamente, al Descubrimiento del Conocimiento en Bases de Datos de Texto.

23

CATEGORIZACIÓN DE TEXTO

La primera forma exitosa de abordar el problema de la CT fue la Ingeniería del Conocimiento (IC o KE, Knowledge Engineering), en la década de 1980. Un ingeniero del conocimiento tenía que elaborar un sistema experto que pudiese clasificar texto automáticamente, pero la falta de conocimientos en la materia requirió la intervención de un experto. Además, el sistema tenía que mantenerse manualmente fuera de horas, con lo que el coste en mano de obra se disparaba. Desde la década de 1990, las técnicas estadísticas vinieron a sustituir el enfoque de IC, convirtiéndolo en un problema para el campo estadístico del Procesado de Lenguaje Natural (PLN o NLP, Natural Language Processing). En este enfoque, el clasificador se elabora usando un proceso inductivo general entrenado sobre un conjunto de documentos de ejemplo. Las ventajas principales de PLN sobre IC son: •

El alto grado de automatización, ya que el ingeniero desarrolla un constructor automático de clasificadores;



Reutilización, porque el constructor automático puede aplicarse a la creación de muchos clasificadores diferentes para muchos problemas y ámbitos distintos, cambiando simplemente el conjunto de documentos de entrenamiento;



Facilidad de mantenimiento, puesto que los cambios en el sistema precisan únicamente modificaciones en el conjunto de entrenamiento y un proceso de entrenamiento nuevo;



Alta disponibilidad (presente y futura) de algoritmos de aprendizaje inductivo;



Precisión de los clasificadores automáticos, que supera a los elaborados por personas expertas.

24

CATEGORIZACIÓN DE TEXTO

El número de tareas de Clasificación de Texto se ha ido incrementando con los años, y se encuentran en la bibliografía varias formas de organizarlas. Según Lewis [Lewis92], las tareas de CT se pueden clasificar usando dos ejes: el tipo de aprendizaje y la granulación de los elementos de texto. Los dos tipos de aprendizaje se definen según el control del conjunto de entrenamiento: 1.

Aprendizaje supervisado: el conjunto de clases se conoce cuando se elabora el conjunto de entrenamiento, y hay ejemplos para cada uno de los casos.

2.

Aprendizaje no supervisado (agrupamiento): el conjunto de clases se conoce antes del entrenamiento, y el objetivo está en agrupar entidades de texto con contenidos similares.

Se pueden definir tres niveles de granulación, teniendo en cuenta términos, frases o documentos como elementos atómicos: 1.

Términos: que van de raíces de palabras o palabras sueltas a expresiones cortas.

2.

Frases: de oraciones simples a oraciones complejas.

3.

Documentos: correos electrónicos cortos de spam, artículos de tamaño medio o incluso libros completos.

El proceso de clasificación de texto Según [Sebastiani06], podemos describir el proceso de CT como la sucesión de tres fases principales: 1.

Ingeniería de atributos. Que consiste en el diseño , y que consta habitualmente de

. La

primera fase es la de Indexado del documento. Los

25

CATEGORIZACIÓN DE TEXTO

documentos han de vincularse a una representación de su contenido que se pueda interpretar directamente tanto por un algoritmo constructor de clasificadores como por el clasificador generado. La representación más comúnmente usada consiste en un vector de los atributos que aparecen en el conjunto de entrenamiento, cada uno con el valor correspondiente al peso que tienen en el documento. El conjunto inicial de atributos se compone habitualmente del conjunto completo de palabras que se definen en lo que se denomina lista de parada. En muchos casos estas palabras se reducen a sus raíces (morfológicas). Los pesos se asignan utilizando heurística estadística que representa hechos como el número de veces que aparece un término en un documento (frecuencia de término) o el número de documentos que contienen el término (frecuencia de término inversa). En los últimos años, algunos trabajos de investigación sobre indexado de documentos han empezado a usar atributos más complejos, bien agrupando palabras sueltas en n-gramas de palabras, bien analizando el texto para obtener información sintáctica o extrayendo conceptos para la representación semántica del texto. Sin embargo, no han demostrado haber mejorado la representación estándar de palabras. La segunda fase dentro de la Ingeniería de Atributos consiste en la selección y extracción de atributos o Reducción de la Dimensionalidad. Los tamaños de los vectores obtenidos después de la primera fase son

26

CATEGORIZACIÓN DE TEXTO

generalmente del orden de decenas de miles o incluso de centenares de miles, dificultando grandemente la eficiencia de los mecanismos de aprendizaje. El segundo paso implica reducir la longitud de esos vectores para producir una representación nueva de los documentos. Las técnicas más comunes para reducir la dimensionalidad se clasifican en métodos de extracción de atributos, como el indexado semántico latente [Li98] o la agrupación de términos [Lewis92]; o en técnicas de selección de atributos, como jicuadrado [Yang97], ganancia de información [Lewis92] o información mutua [Dumais98]. Los métodos de extracción de atributos combinan varias dimensiones en lo que será un nuevo atributo en el vector reducido; las técnicas de selección de atributos intentan determinar y seleccionar los mejores atributos para el conjunto original, en lugar de generar nuevos atributos. 2.

Aprendizaje para crear el clasificador. Un proceso inductivo general entrenado con un conjunto de documentos de ejemplo genera automáticamente un clasificador de texto. La representación para cada documento se obtiene después de la segunda fase. Entre los mecanismos de aprendizaje supervisados más habituales usados para la Categorización de

Texto,

podemos

citar

los

modelos

bayesianos

probabilísticos, las redes bayesianas, árboles de decisión, métodos de decisión booleana, redes neuronales, conjuntos de clasificadores o Support Vector Machines, pero el número de técnicas que se han estudiado es mayor

27

CATEGORIZACIÓN DE TEXTO

[Sebastiani02]. Estos algoritmos difieren ampliamente en cuanto al tipo de modelo creado, eficacia y rendimiento, así como en capacidad de manejar enormes cantidades de atributos. Actualmente, las Support Vector Machines (SVM) [Joachims98] y el Boosting [Schaphire00] destacan con respecto a las demás, puesto que han superado a sus competidores en diferentes pruebas de referencia y desafíos. 3.

Evaluación. El aspecto más importante para la evaluación de clasificadores es la eficiencia, puesto que resulta de extrema importancia minimizar los errores del sistema. Pese a

ello, a

veces conviene

considerar

otras medidas

relacionadas con el rendimiento, inteligibilidad de los modelos, portabilidad y escalabilidad de las técnicas, etc. De las fases anteriores, las que más impacto van a tener en la clasificación de textos con adversario van a ser la ingeniería de atributos y la evaluación. La primera porque es la que centra los ataques de los adversarios, ya la segunda porque es clave para poder medir la eficiencia de los categorizadores en entornos hostiles en los que los costes de los errores no son simétricos, y además suelen ser desconocidos a priori. A continuación describimos con más detalle estas fases.

Ingeniería de Atributos Se llama Ingeniería de Atributos al proceso de decidir, sobre un conjunto de instancias de entrenamiento dado, qué propiedades se tendrán en consideración para aprender de ellas. Las propiedades son las características o atributos, y pueden tomar diferentes valores; así,

28

CATEGORIZACIÓN DE TEXTO

cada instancia de entrenamiento se representa en un vector sobre un espacio multidimensional en el que las dimensiones son los atributos. Como muchos algoritmos de Aprendizajes Automáticos resultan muy lentos, o simplemente se muestran incapaces de aprender en espacios altamente dimensionados, generalmente se necesita reducir el número de atributos usados en la representación, realizando una selección (determinando un subconjunto adecuado de los atributos originales) o una extracción de atributos (mapeando el conjunto original de atributos en un conjunto nuevo, más reducido). Estas tareas forman parte también del proceso que se conoce como Ingeniería de Atributos. Tokens y pesos En Categorización de Texto, los atributos usados más frecuentemente son las secuencias de caracteres, o cadenas que transportan mínimamente alguna clase de significado en un texto, esto es, las palabras [Gomez02a, Sebastiani02]. En términos más generales, hablamos de dividir un texto en tokens, proceso que llamamos tokenización. De hecho, este proceso tan solo sigue el modelo de Espacio Vectorial de recuperación de la información de Salton [Salton81]. Este modelo especifica que, para el propósito de la recuperación, los textos se pueden representar como vectores términopeso, en los que los términos son palabras procesadas (nuestros atributos) y los pesos son valores numéricos que representan la importancia de cada palabra en cada documento. Los primeros filtros basados en aprendizaje tomaban decisiones relativamente simples en este sentido, siguiendo lo que por entonces era la vanguardia en Categorización de Texto temática. La definición

29

CATEGORIZACIÓN DE TEXTO

más simple de ‘características’ es palabras, siendo una palabra cualquier

secuencia

de

caracteres

alfabéticos, y

considerando

cualquier otro símbolo un separador o un espacio en blanco. Dicho enfoque se siguió en el trabajo pionero en este campo, por Sahami et al. [Sahami98]. El trabajo fue mejorado por Androutsopoulos et al. [Androutsopoulos00a,

Androutsopoulos00b,

Androutsopoulos00c],

haciendo uso de un lematizador (o stemmer) para trazar palabras en su raíz y de una lista de parada (una lista de palabras frecuentes que pueden pasarse por alto porque aportan más ruido que significado a la búsqueda temática: pronombres, preposiciones, etc.). En el trabajo citado, las características son binarias, esto es, su valor es uno si la muestra se encuentra en el mensaje, y cero en caso contrario. Existen varias definiciones posibles más de los pesos o valores, que tradicionalmente figuran en el campo de Recuperación de la Información. Por ejemplo, utilizando el mismo tipo de muestras o características, los autores de [Gomez02a] y [Gomez02b] hicieron uso de pesos TF.IDF, que quedan definidos de la siguiente forma:

 N wij = tf ij ⋅ log 2   df i

  

Siendo tfij el número de veces que la muestra i-ésima aparece en el mensaje j-ésimo, N el número de mensajes, y dfi el número de mensajes en los que la muestra i-ésima aparece. La parte del peso TF (Frecuencia de Término) representa la importancia de la muestra en el documento o mensaje actual, mientras que la segunda parte, IDF (Frecuencia del Documento Inversa), da una idea apropiada de la importancia de la muestra en la colección completa de documentos. Los pesos TF son posibles también.

30

CATEGORIZACIÓN DE TEXTO

Incluso las decisiones relativamente simples, como poner todas las palabras en minúsculas, pueden afectar notablemente el rendimiento de un filtro. La segunda generación de filtros de aprendizaje ha sido influida notablemente por

el

trabajo

de Graham

[Graham02,

Graham03], quien se aprovechó de la potencia y velocidad cada vez mayores de los ordenadores para omitir la mayor parte de preprocesado y simplificar las decisiones que se tomarían después. Graham usa una definición de muestra más compleja: 1.

Caracteres alfanuméricos, guiones, apóstrofes, signos de exclamación y signos de dólar forman parte de las muestras, y todo lo demás es un separador.

2.

Las muestras compuestas exclusivamente de dígitos se omiten, junto con los comentarios html, no considerándolos ni siquiera separadores.

3.

Se mantienen las mayúsculas y minúsculas, y no se hace stemming para las listas de parada.

4.

Puntos y comas se tienen en cuenta solo si aparecen entre dos dígitos. De este modo, permanecen inalterados las direcciones IP y los precios.

5.

Los rangos de precio, como $20-25, se asignan a dos muestras: $20 y $25.

6.

Las muestras que aparecen dentro de las líneas Para, De, Asunto y Dirección de respuesta, o dentro de las URLs, quedan marcadas correspondientemente. Por ejemplo, “co” en la línea de Asunto se convierte en “Asunto*co”. (El asterisco podría ser cualquier carácter que no se admite como componente).

31

CATEGORIZACIÓN DE TEXTO

Graham obtuvo unos resultados muy favorables en su filtro de correo basura con esta definición de marcas y una versión ad-hoc de aprendizaje

bayesiano.

La

definición

ha

inspirado

otras

más

sofisticadas, pero también ha conducido a los spammers a centrarse en la tokenización como una de las principales vulnerabilidades de los filtros basados en el aprendizaje. La tendencia actual es justamente la contraria: no hacer prácticamente análisis del texto, considerando cualquier cadena separada de espacios en blanco como una muestra y dejando que el sistema aprenda de un número realmente grande de mensajes (decenas de miles en lugar de miles). Y aún más, el HTML no se descodifica, y las muestras pueden incluir etiquetas HTML, atributos y valores. Atributos multi-palabra Algunos investigadores han estudiado características divididas entre dos o más marcas, buscando patrones para “hazte rico”, “sexo gratis” o “software OEM”. Usar frases de palabras estadísticas no ha obtenido resultados aceptables en Recuperación de la Información [Salton83], llegando incluso a bajar la efectividad. Sin embargo, han tenido un razonable éxito en el filtrado de spam. Dos trabajos importantes en esta línea son los de Zdziarski [Zdziarski04] y Yerazunis [Yerazunis03, Yerazunis04].

32

CATEGORIZACIÓN DE TEXTO

Palabras marca

P(spam|w1)

P(spam|w2)

P(spam|w1*w2)

w1=FONT, w2= face

0.457338

0.550659

0.208403

w1=color, w2=#000000

0.328253

0.579449

0.968415

w1=que, w2= enviado

0.423327

0.404286

0.010099

Tabla 1. Algunos ejemplos de marcas encadenadas, según [Zdziarski04]. Zdziarski

usa

por

primera

vez

palabras

teniendo

en

cuenta

mayúsculas/minúsculas en su filtro DSPAM, añadiendo después lo que llamaba “marcas en cadena”. Estas marcas consisten en secuencias de dos palabras adyacentes que atienden a las reglas siguientes: •

No hay cadenas entre la cabecera y el cuerpo del mensaje.



En la cabecera del mensaje no hay cadenas entre encabezados individuales.



Las palabras pueden combinarse con marcas que no sean palabras.

Las marcas encadenadas no sustituyen las marcas individuales, sino que las complementan para utilizarlas conjuntamente para un mejor análisis. Por ejemplo, si estamos analizando un mensaje con la frase “¡LLAMA AHORA, ES GRATIS!”, hay cuatro marcas creadas según un análisis normal (“LLAMA”, “AHORA”, “ES” y “GRATIS”), pero tres marcas más en cadena: “LLAMA AHORA”, “AHORA ES”, “¡ES GRATIS!”. Las marcas en cadena se llaman tradicionalmente bigramas de palabras en el campo del Modelado de Lenguaje y Recuperación de la Información. En la tabla 1 vemos cómo las marcas en cadena pueden llevarnos a estadísticas mejores (más exactas). En esta tabla, se muestra la

33

CATEGORIZACIÓN DE TEXTO

probabilidad de spam dada la aparición de una palabra, la probabilidad condicional que normalmente se calcula con la siguiente fórmula:

P ( spam | w) ≈

N ( spam, w) N ( w)

Donde N(spam,w) es la cantidad de veces que w aparece en los mensajes y N(w) el número de veces que aparece la palabra w. La cuenta puede hacerse también por mensaje: el número de mensajes de spam en los que aparece w, y el número de mensajes en los que aparece w. En la tabla, podemos ver que las palabras “FONT” y “face” tienen probabilidades cercanas a 0,5, lo que quiere decir que no admiten ni spam ni correo legítimo. Sin embargo, la probabilidad del bigrama vale alrededor de 0,2, lo que representa un grado bastante alto de admisión de la clase legítima. Esto se debe al hecho de que los spammers y los usuarios legítimos usan diferentes patrones de código HTML. Mientras que los primeros usan códigos ad-hoc, los segundos generan mensajes HTML con clientes de correo populares como Microsoft Outlook o Mozilla Thunderbird, que utilizan siempre los mismos patrones (más legítimos), como poner el atributo face de la fuente cerca de la etiqueta HTML ‘FONT’. Podemos ver también cómo, aún siendo “color” y “#000000” bastante neutrales, el patrón “color=#000000” (el símbolo “=” es un separador) no es inocente en absoluto. Los experimentos de Zdziarski demuestran sensibles bajadas en la proporción de errores, especialmente en falsos positivos, cuando se usan marcas en cadena además de las marcas normales.

34

CATEGORIZACIÓN DE TEXTO

Yerazunis adopta un enfoque ligeramente más sofisticado en [Yerazunis03] y [Yerazunis04]. Dado que los spammers han empezado ya a engañar a los filtros basados en aprendizaje disfrazando con símbolos intermedios las expresiones consideradas como spam, propuso enriquecer el espacio de características con bigramas obtenidos de la combinación de marcas en una ventana deslizante de 5 palabras sobre los textos de entrenamiento. Llamó a esto SBPH (Sparse Binary Polynomial Hash, o Dispersión Polinómica Binaria Escasa) y lo implementó en su filtro Discriminador CRM114. En una ventana se construyen todos los pares de palabras, ordenadas con la segunda palabra al final. Por ejemplo, dada la frase/ventana “Usted puede conseguir porno gratis”, se generan los cuatro bigramas siguientes: “Usted gratis”, “puede gratis”, “conseguir gratis”, “porno gratis”. Con estas características, y con lo que el autor llama la Bayesian Chain Rule (una aplicación simple del teorema de Bayes), obtuvo resultados impresionantes en su correo personal, asegurando que había alcanzado un 99,9% estable de precisión. Selección y Extracción de Atributos La reducción de la dimensionalidad es un paso requerido, ya que aumenta la eficiencia y reduce el sobreajuste. Muchos algoritmos tienen un rendimiento muy pobre cuando funcionan con gran cantidad de atributos (con excepción de kNN o SVM), de modo que se necesita un proceso que reduzca el número de elementos para representar documentos. Existen principalmente dos modos de conseguir este objetivo: selección de atributos y extracción de atributos.

35

CATEGORIZACIÓN DE TEXTO

La selección de atributos trata de obtener un subconjunto de elementos con igual o superior capacidad predictiva que el conjunto de términos original. Para seleccionar los mejores términos tenemos que usar una función que seleccione y clasifique los térmicos conforme a su grado de conveniencia. Esta función mide la calidad de los atributos. En la literatura, los términos se seleccionan por lo general con respecto a su puntuación en Ganancia de Información (GI) [Sahami98], [Androutsopoulos00a], [Sakkis01], y, a veces, con arreglo a métricas ad-hoc [Gomez00], [Pantel98]. La información de ganancia se describe como:

IG ( X , C ) =



x = 0 ,1; c = u ,l

P( X = x, C = c ) ⋅ log 2

P ( X = x, C = c ) P( X = x ) ⋅ P(C = c )

Curiosamente, la IG resulta ser una de las mejores métricas de selección [Yang 97]. Otras métricas de calidad son Información Mutua [Dumais98]; [Larkey96],

χ2

[Caropreso01]; [Lewis92]; [Yang97], Frecuencia de

Documento [Yang97]; [Sebastiani02] o Puntuación de Relevancia [Wiener95]. De todas estas formas de medir,

χ

2

se muestra como otra de las más

efectivas en la literatura. Se define así:

χ 2 (ti , ck ) =

36

[

(

) (

) ( ( )

D ⋅ P(ti , ck ) ⋅ P t i , c k − P t i , ck ⋅ P ti , c k P(ti ) ⋅ P t i ⋅ P(ck ) ⋅ P c k

()

)]

CATEGORIZACIÓN DE TEXTO

La extracción de atributos es una técnica cuyo objetivo se centra en generar un conjunto artificial de términos diferente y más pequeño que el original. Las técnicas usadas para la extracción de atributos en Categorización de Texto Automática son Agrupación de Términos e Indexado Semántico Latente (Latent Semantic Indexing). La agrupación de términos crea grupos de términos relacionados semánticamente. Esta técnica es muy común cuando se emplea tokenización de los mensajes de correo para su representación [Graham03]. Latent Semantic Indexing [Deerwester90] intenta resolver el problema originado por la polisemia y la sinonimia cuando se indexan documentos. Al crear un espacio con una dimensionalidad más baja, combinando vectores originales con patrones de términos que aparecen juntos, comprime los vectores de los índices.

Evaluación La evaluación de un sistema de categorización es un aspecto crítico, en tanto que sin ella no es posible tomar decisiones sobre su calidad o sobre su implantación. La evaluación de los sistemas de clasificación en general está centrada en la eficacia o efectividad, ya que se trata de sistemas que toman decisiones que pueden estar equivocadas (a diferencia de, por ejemplo, los Sistemas de Gestión de Bases de Datos: si toman una decisión equivocada ante una consulta, no son poco efectivos: simplemente, no funcionan). Aunque la evaluación está generalmente centrada en la efectividad, en ocasiones se considera la eficiencia, y con menos frecuencia la comprensibilidad de los modelos, la portabilidad y escalabilidad de las

37

CATEGORIZACIÓN DE TEXTO

técnicas, etc. En general, las características a tener en cuenta serán las mismas que las de un sistema de Procesamiento de Lenguaje Natural en general, estandarizadas de manera parcial en Europa por el grupo EAGLES [Eagles95]. En general es esencial que los elementos más relevantes utilizados en la evaluación (procedimientos, métricas y colecciones) sean estándar, con el fin de facilitar la comparación de resultados con trabajos previos o de otros autores. Una dificultad añadida en el caso de la clasificación con adversario es que los documentos problema (correo basura, Webs pornográficas, etc.) evolucionan con rapidez para adaptarse a los clasificadores más efectivos, por lo que la validez de las colecciones y de los propios clasificadores es limitada en el tiempo. Procedimiento de evaluación En general, se parte de una colección de documentos categorizados manualmente, sobre la que entrenar y evaluar el clasificador. Obviamente, no es sensato entrenar y evaluar sobre los mismos documentos, por lo que la colección se fracciona en una subcolección de

entrenamiento

y

otra

de

evaluación

[Sebastiani02].

Este

procedimiento se ha aplicado tradicionalmente con la colección Reuters-21578 [Lewis97], la más utilizada en el ámbito de la categorización temática. Sin embargo, en la literatura se han consolidado varias particiones distintas con diferentes grados de dificultad para un clasificador, lo que no ha facilitado precisamente la comparación [Yang99]. El procedimiento anterior priva de datos tanto al entrenamiento (se puede asumir que cuanto mayor sea la colección de entrenamiento,

38

CATEGORIZACIÓN DE TEXTO

mejor será el clasificador) como a la evaluación (cuanto mayor sea la colección de evaluación, más fiable será la medida). Actualmente el enfoque más seguido (especialmente en los trabajos recientes de clasificación con adversario, y en las evaluaciones competitivas asociadas) es la validación cruzada. Esta técnica consiste en dividir la colección global en K grupos de igual tamaño (frecuentemente K=10), de manera que se mantenga la proporción de clases en cada grupo. A continuación, se lanzan K pruebas, usando en cada una un grupo para la evaluación y el resto para entrenamiento, tomándose una medida que finalmente se promedia sobre el total de las pruebas. Métricas de efectividad Las métricas de eficacia usadas en la evaluación de los sistemas de categorización de texto provienen, como es lógico, de los campos de la Recuperación

de

Información

y

del

Aprendizaje

Automático.

Usualmente, estas métricas están basadas en contar el número de éxitos en las decisiones de clasificación en relación con el número total de decisiones a tomar, sobre la colección de evaluación. Usando la colección de evaluación, se aplica el clasificador obtenido a los ejemplos de la misma y se obtiene una matriz o tabla de confusión o contingencia. En dicha tabla, se calculan los siguientes números: •

TP (True Positives, positivos correctos). Número de ejemplos de la clase clasificados correctamente.



FN (False Negatives, negativos incorrectos). Número de ejemplos de la clase clasificados incorrectamente.

39

CATEGORIZACIÓN DE TEXTO



FP (False Positives, positivos incorrectos). Número de ejemplos que no están en la clase pero han sido clasificados en ella, incorrectamente.



TN (True Negatives, negativos correctos). Número de ejemplos que no están en la clase y han sido correctamente clasificados como no pertenecientes a ella.

A partir de estos valores, las medidas más habituales son las siguientes [Salton83], [Sebastiani02]: Cobertura (recall, R) – representa la proporción de documentos correctamente clasificados en C+ sobre los que debían estar en ella. Precisión (precision, P) – representa la proporción de documentos correctamente clasificados en C+ sobre los que han sido clasificados en ella. Exactitud (accuracy, A) – representa la proporción de aciertos total sobre el número de intentos. Error (E) – representa la proporción de errores total sobre el número de intentos. Estas medidas se computan con las siguientes fórmulas:

R=

TP TP + TN FP + FN TP A= P= E= TP + FN TP + FP N N

Las medidas de exactitud y error son opuestas, y frecuentemente se descartan en categorización, porque en casos de categorías poco pobladas (o en otras palabras, de eventos poco probables) es muy sencillo conseguir un clasificador muy efectivo de manera trivial.

40

CATEGORIZACIÓN DE TEXTO

Las medidas de cobertura y precisión son complementarias, y generalmente se puede aumentar una a costa de la otra, pero es difícil incrementar ambas simultáneamente. Se pueden colapsar en una sola usando la medida F1 de Van Rijsbergen [Salton83], que las promedia.

41

CATEGORIZACIÓN DE TEXTO

42

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO

III.

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO

La clasificación de texto con adversario se puede definir genéricamente como aquella tarea de clasificación de texto en la que existe un adversario cuyo objetivo es que el clasificador producido, ya sea manual o automáticamente con aprendizaje, deje de ser útil. El elemento fundamental de la clasificación de texto con adversario es la presencia del adversario que pretende que el clasificador deje de ser útil. Las dos maneras más obvias para lograr que un clasificador deje de ser útil es lograr que cometa un número de errores sustancial, que pueden ser por defecto (por ejemplo, en el caso de un filtro de spam, que no detecte un número importante de mensajes de correo basura, es decir, que incurra en muchos falsos negativos) o por exceso (por ejemplo, que clasifique muchos mensajes legítimos como basura, es decir, que incurra en muchos falsos positivos): 1.

Si el clasificador incurre en muchos falsos negativos (pongamos, por ejemplo, un 50%), el efecto es que el usuario se ve obligado a detectar manualmente un número importante de documentos (mensajes basura, páginas Web pornográficas, páginas Web de spam). El coste de este trabajo es variable y dependiente de la aplicación concreta, pues resulta una tarea relativamente sencilla en el caso del correo basura, pero mucho más compleja en el caso de los contenidos Web inapropiados y del spam Web. Mientras que, en el caso del correo basura, cualquier ayuda puede ser buena (sólo con que elimine la mitad del correo basura, el sistema puede ser útil aunque mejorable), en los otros

43

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO

casos no es factible recorrer los millones de páginas Web de cada dominio inapropiado (pornografía, juegos de casino, racismo y sectas, etc.) o de spam Web para tratar de evitar que el usuario a proteger acceda a ellas, o que sean devueltas en las consultas al motor de búsqueda. Más aún, es muy probable que en algún caso, el propio usuario final no desee informar del error. 2.

Si el clasificador incurre en muchos falsos positivos, el efecto es que el usuario se ve obligado a corregir la clasificación de estos errores, lo que puede conllevar un coste mucho mayor en ocasiones. Por ejemplo, en el caso del correo basura, los mensajes legítimos clasificados como basura pueden ser borrados directamente, o incluidos en una carpeta especial de “cuarentena” que el usuario debe visitar periódicamente, y siempre existe el riesgo de que pierda mensajes importantes (acontecimientos familiares, mensajes de trabajo de clientes o de directivos, etc.). En el caso del filtrado de contenidos inapropiados, el error es menos

grave

porque

el

usuario

tiene

constancia

rápidamente del mismo (ya que no puede acceder al contenido), y puede solicitar la reclasificación (que a su vez puede ser costosa).

44

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO

El problema de los costes asimétricos y desconocidos Observando los distintos tipos de errores (falsos negativos – FN, y falsos positivos – FP) que cometa el categorizador de texto, podemos extraer una serie de conclusiones: 1.

Los costes son distintos dependiendo de la tarea. Mientras que en, por ejemplo, el coste potencial de un falso positivo en el filtrado de correo basura puede ser muy alto porque un correo legítimo puede pasar desapercibido para el usuario, el coste del mismo tipo de fallo en el filtrado de contenidos Web es menor porque el usuario final ve bloqueada la página y percibe dicho error con más rapidez (aunque pueda llevar un esfuerzo mayor reclasificar la página bloqueada).

2.

En este tipo de tareas de clasificación de texto con adversario se observa que los costes son asimétricos, es decir, el coste de un error por defecto es distinto del coste de un error por exceso, debido a varias razones. Por un lado, el coste depende del riesgo de que el error pare desapercibido a un usuario o a un supervisor. En particular, existe una probabilidad no desdeñable de que determinados errores no sean notificados por el usuario al supervisor, como en el caso de la detección de contenidos inapropiados. Por otro lado, el coste de corregir un error es dependiente de la tarea y de cómo se organice la infraestructura de funcionamiento. Por ejemplo, es más fácil corregir un error en un mensaje basura (se elimina o envía a la cuarentena con un clic) que en el filtrado de contenidos inapropiados

45

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO

(el usuario debe notificar el error, y el supervisor revisar el contenido y reclasificarlo). Finalmente, en determinados contextos un error puede ser más grave que en otros. Por ejemplo, es más dañino que un niño acceda a un contenido pornográfico, que lo haga una persona adulta. Y el caso extremo se da cuando el contenido está siendo limitado por razones médicas, como en el caso de los adictos a los juegos de casino. Conviene reseñar que los costes normalmente son indeterminados y dependen de la distribución de clases. Por ejemplo, los mensajes de correo basura constituyen un alto porcentaje del correo electrónico que circula en la red actualmente (superior al 80%), por lo que un error es potencialmente más probable que en el caso del acceso a un contenido pornográfico (que constituyen menos de un 2% de los presentes en la Web, aunque no en los sistemas de P2P). En este sentido, puede ser importante tener en cuenta los costes acumulativos de todos los errores de un tipo. Además, la distribución de clases (e.g. correo basura vs. correo legítimo) puede variar según el proveedor del servicio (por ejemplo, habrá más correo basura en un proveedor de correo Web como Microsoft Live Mail que en un proveedor corporativo), el lugar de prestación del servicio (puede haber más correo basura en un país o región que en otro), etc. La existencia de costes asimétricos y variables, así como la distribución cambiante de las clases, es un elemento que tiene una influencia muy importante en el método de evaluación de los clasificadores, que debe ser capaz de tener estos factores en cuenta.

46

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO

Algunos autores como [

] han evaluado en

situaciones de costes asimétricos, pero totalmente ficticios: en concreto, se ha realizado la evaluación usando una relación de coste de 1, 9, 99 y 999, pero no hay ninguna evidencia que justifique esas cantidades, ya que los costes reales son desconocidos en el momento de realizar el proceso de aprendizaje.

Ataques al Clasificador La presencia del adversario hace que, como consecuencia de sus ataques al clasificador, su eficacia pueda degradarse con rapidez. Como consecuencia, el productor del clasificador (e.g. el fabricante del filtro) pone en marcha contramedidas que limitan la eficacia de los ataques del adversario, y logra que el clasificador recupere efectividad. Sin embargo, este ciclo continúa indefinidamente porque el adversario refina sus ataques de nuevo, y así sucesivamente. En otras palabras, se entra en una especie de “carrera armamentística” entre el productor de clasificadores y su adversario, en la que ambos están motivados económicamente para no abandonarla: el productor de clasificadores es un fabricante de herramientas de seguridad que desea que sus herramientas sean efectivas para poder comercializarlas, y el adversario obtiene pingües beneficios de sus actividades ilícitas. Cuando hablamos de ataques directos nos referimos a ataques contra el núcleo de los clasificadores estadísticos para intentar hacerlos errar. La efectividad de los ataques depende en gran medida del tipo de clasificador, de su configuración y del entrenamiento anterior (mensajes recibidos y marcados como spam por el usuario). Uno de los ataques más simples consiste en la inclusión de palabras aleatorias,

47

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO

con la intención de que dichas palabras se reconozcan como “palabras correctas” por el clasificador. Este ataque es muy simple y se ha considerado ineficaz [cumming04], pero nos proporciona una idea general del funcionamiento de los ataques al proceso de aprendizaje. Podemos encontrar cuatro categorías principales de ataques directos contra el proceso de aprendizaje: •

Tokenización: Los ataques que emplean tokenización centran su actividad en la selección de atributos que usa el clasificador para extraer los atributos principales de los textos. Algunos ejemplos de ataques de tokenización los tenemos en la división de palabras con espacios en blanco, guiones o asteriscos, o el uso de trampas HTML, JavaScript o CSS.



Ofuscación: En este tipo de ataques, los contenidos del mensaje se ocultan usando distintos tipos de codificación, incluyendo una entidad HTML o codificación de URL, sustitución de caracteres, codificación en Base64 Imprimible y otros.



Estadísticos: Estos métodos intentan manipular las estadísticas del mensaje, añadiendo más muestras buenas o usando menos muestras malas. Existen variaciones de este tipo de ataques dependiendo de qué usan para seleccionar las palabras. Los ataques estadísticos débiles o ataques pasivos usan palabras aleatorias; los ataques estadísticos fuertes o ataques activos seleccionan cuidadosamente las palabras necesarias para equivocar al clasificador usando algún tipo de realimentación (feedback). Los ataques estadísticos fuertes no son más que una versión refinada de los ataques estadísticos débiles.

48

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO



Esconder el texto: Algunos ataques evitan el uso de palabras e insertan los mensajes de spam como imágenes, flash, rtf o algún otro formato de fichero; algunos ataques insertan un enlace a una página web donde reside el verdadero mensaje de spam. El objetivo es que el usuario vea el mensaje, mientras que el clasificador no puede extraer ningún atributo relevante.

En conclusión, de las secciones anteriores podemos extraer que en la Categorización de Texto con Adversario existe una situación de asimetría de costes y de clases, que exige un método de evaluación adecuado, y que la ingeniería de atributos es la parte más importante del proceso porque es donde aparece la mayor debilidad,. tal y como lo han demostrado los ataques comentados.

49

LA CLASIFICACIÓN DE TEXTO CON ADVERSARIO

50

APLICACIONES

IV.

APLICACIONES

Aplicación al Filtrado de Correo Basura En los últimos años, el spam en correo electrónico se ha convertido en un problema que aumenta paulatinamente, y que está suponiendo un gran impacto económico en la sociedad. Presentamos en este capítulo el problema del spam, cómo nos afecta y cómo podemos luchar contra él; En concreto, comentaremos la estructura y el proceso de Aprendizaje Automático usados para esta tarea, así como la forma de hacerlos sensibles al coste por medio de diferentes métodosTambién tratamos la forma de evaluar los filtros de spam usando métricas básicas, Métrica TREC y el método Receiver Operating Characteristic Convex Hull, que se ajusta mejor a la clasificación de aquellos problemas en los que las condiciones objetivo no se conocen, como es nuestro caso, Qué es el Spam En la bibliografía encontramos diversos términos para referirse al correo no solicitado: correo basura, correo masivo o mensajes publicitarios no solicitados (UCE, en inglés) son algunos de ellos; pero la palabra de referencia más común es “SPAM”. No queda clara la procedencia de la palabra spam, pero bastantes autores coinciden en señalar que el término se tomó de un sketch de Monty Python en el que una pareja entra en un restaurante y la esposa trata de que le sirvan algo diferente a Spam (carne de cerdo en conserva). Al fondo, un grupo de vikingos canta las excelencias del Spam: “SPAM, SPAM,

51

APLICACIONES

SPAM, SPAM... delicioso SPAM, maravilloso SPAM”. Pronto, la única palabra que puede oírse en escena es “Spam”. La misma idea puede aplicarse a Internet si se tolera un número desproporcionado de envíos espurios. Terminarían por no distinguirse los mensajes reales de la masa de Spam. Pero, ¿cuál es la diferencia entre correo legítimo y spam? Podemos considerar un correo electrónico como spam si cumple con las siguientes características: •

No solicitado. El destinatario no está interesado en recibir la información.



Remitente desconocido. El destinatario no conoce, ni está relacionado, con el remitente.



Masivo. El mensaje se ha mandado a una gran cantidad de direcciones.

El problema del correo electrónico spam El problema del spam en el correo puede cuantificarse en términos económicos. Los trabajadores desperdician un buen número de horas al día. No es ya el tiempo que pierden leyendo spam, sino también el que dedican a borrar esos mensajes. Pensemos en una red corporativa de 500 ordenadores en la que cada uno de ellos recibe como 10 mensajes de spam cada día. Si a causa de uno de esos correos se desperdician 10 minutos, podemos estimar el gran número de horas que se pierden a causa del spam. La Conferencia de las Naciones Unidas sobre Comercio y Desarrollo calcula que el impacto económico mundial del spam podría alcanzar los 20.000 millones de dólares en tiempo perdido y productividad.

52

APLICACIONES

La Asamblea Legislativa de California llegó a la conclusión de que el spam costó, solamente a las empresas de EEUU, más de 10.000 millones de dólares en 2004, incluyendo pérdida de productividad y equipo adicional, el software y la mano de obra necesarios para combatir el problema. Tanto si un trabajador recibe docenas como tan solo unos pocos mensajes cada día, leerlos y borrarlos le lleva tiempo, bajando su productividad laboral. Un comunicado de Nucleus Research asegura que el spam les costará a los empleadores estadounidenses 2000 dólares por empleado en pérdida de productividad. Nucleus concluyó que el correo no solicitado redujo la productividad de los trabajadores un asombroso 1,4%. Las soluciones de filtrado de spam hicieron poco para controlar esta situación, reduciendo los niveles de spam solamente un 26% como promedio. Hay también problemas relacionados con los escollos técnicos causados por el spam. A veces, el spam puede ser peligroso, conteniendo virus, troyanos u otra clase de malware, abriendo brechas de seguridad en ordenadores y redes. Los administradores de red y de correo electrónico tienen que emplear tiempo y esfuerzos en desarrollar sistemas para combatir el spam. Estructura de procesado del filtrado de spam La estructura de procesado en la tarea de Filtrado de Correo Basura sigue la estructura descrita para la Categorización de Texto con Adversario. Básicamente encontramos las siguientes fases:

53

APLICACIONES

1. Ingeniería de Atributos: En la que los mensajes de correo electrónico se representan como vectores de atributos y luego se realiza el proceso de reducción de la dimensionalidad, cuyo objetivo se centra en generar un conjunto artificial de términos diferente y más pequeño que el original. Las técnicas usadas para la extracción de atributos en el filtrado de correo basura son Agrupación de Términos los basados en frases que ya hemos mencionado con anterioridad, y el Indexado Semántico Latente (Latent Semantic Indexing). La agrupación de términos crea grupos de términos relacionados semánticamente. Esta técnica es muy común cuando se emplea tokenización de los mensajes de correo para su representación [Graham03]. Latent Semantic Indexing [Deerwester90] intenta resolver el problema originado por la polisemia y la sinonimia cuando se indexan documentos. 2. Aprendizaje: Uno de los conjuntos de técnicas más importantes en categorización de documentos es el de los algoritmos de aprendizaje, cuyo objetivo consiste generar la función que se aproxima a la que asigna un documento a una categoría, en este caso a las categorías “spam” o “legítimo”. En los artículos que forman esta Tesis describimos con detalle distintos algoritmos utilizados para filtrar el correo basura. 3. Evaluación: La evaluación del sistema clasificador es un punto crítico: no se puede esperar ningún progreso si no hay forma de valorarlo.

Inicialmente

la

comunidad

de

Aprendizaje

Automático ha utilizado técnicas de evaluación tradicionales sin tener en cuenta uno de los principales problemas del

54

APLICACIONES

filtrado de spam: los costes de la clasificación errónea. El hecho de que un falso positivo (un mensaje legítimo clasificado como spam) sea mucho más perjudicial que un falso negativo (spam clasificado como legítimo) implica que las medidas de evaluación atribuyan más peso a los peores errores. Por este motivo no se puede evaluar los filtros de espam asumiendo una simetría en los costes de los erores. Hay que realizar la evaluación teniendo en cuenta el mayor peso de los peores errores, pero... ¿qué peso? [Androutsopoulos00a],

ha

calculado

para

una

variedad

de

clasificadores en tres escenarios correspondientes a tres valores de costes: (1:1, 1:9 y 1:999). El principal problema presentado en ese trabajo es que el coste usado no corresponde con las condiciones del mundo real, que no son conocidas y que pueden ser muy variables. No hay prueba de que un falso positivo sea ni 9 ni 999 veces peor que el error opuesto. En esta Tesis nosotros hemos diseñado métodos más adecuados que han pasado a utilizarse en la práctica y que se describen con más detalle en los artículos que la conforman, en particular hemos propuesto el uso del método ROCCH para la evaluación de tareas de Categorización de Texto con Adversario como es el filtrado de correo basura.

55

APLICACIONES

Aplicación al Filtrado de Contenidos Web La disponibilidad en aumento de contenido inapropiado, peligroso e ilegal en la Web ha motivado el surgimiento de filtros de Internet y sistemas de control como herramientas de protección y de aplicación de las normas. El problema del filtrado web Algunos de los motivos por los que se hace necesario contar con medidas de filtrado y control de Internet pueden ser: •

Controlar el abuso de Internet en el lugar de trabajo: Los servicios de Internet son esenciales en las empresas de hoy en día, pero, como en Internet abundan los contenidos de entretenimiento (desde la pornografía a la industria de las apuestas, pasando por noticias o viajes), algunos empleados lo usan en ocasiones para perder tiempo y recursos en tareas no relacionadas con su trabajo. El acceso de los trabajadores a contenido Web inapropiado puede tener un impacto notable sobre la empresa [Wynn01].



Protección infantil: Internet y la Web son unas herramientas extremadamente útiles para los niños y los jóvenes, bien como fuente de información y divertimento, bien como instrumento de comunicación con sus amigos y familiares. Sin embargo, los niños resultan ser un grupo de usuarios de Internet en extremo sensible, por hallarse en fase de formación; por ello, han de ser protegidos especialmente. Los niños se enfrentan a bastantes peligros en Internet, como [Kam07]: Exposición a contenidos inapropiados e incluso ilegales, contacto y acoso por parte de

56

APLICACIONES

depredadores

sexuales

en

la

red,

ciber-intimidación,

actividades ilegales, desde ser víctimas de fraudes online a participar en hacking, pasando por compartir contenido ilegal. Así, los filtros web han evolucionado hacia la inclusión de técnicas más sofisticadas y efectivas, para cubrir no solo la información entrante (por ejemplo, contenido Web), sino también la información saliente (secretos industriales, números de tarjetas de crédito, etc.), y analizar un número más amplio de tipos de contenido y de protocolos (mensajería instantánea, P2P, juegos online concretos, etc.). Gracias a esta evolución, los filtros han conducido a importantes desarrollos e innovaciones en otros campos de investigación como el procesado de imagen. Estructura de procesado del filtrado web Las tres fases para el proceso de Categorización de Texto descritas anteriormente han de adaptarse a las peculiaridades de este problema de la siguiente manera: 1. Ingeniería de Atributos: Lo primero que hay que realizar es el indexado de documento. Las páginas web tienen un formato HTML enriquecido que la mayoría de los trabajos de investigación usan indebidamente. No obstante, algunos enfoques hacen un uso explícito de él, como el sistema descrito en [Agarwal06], que considera siete secciones específicas en los documentos web: URL, etiquetas de hipervínculos, etiquetas de imágenes, título, metadatos, cuerpo y tablas. Las palabras se separan según la sección, generando así vectores de vectores para representar páginas web.

57

APLICACIONES

El vector de palabras es la representación más ampliamente utilizada, con la aplicación de listas de parada y stemming cuando

el

occidentales).

lenguaje

lo

admite

(usualmente,

lenguajes

La siguiente tarea es la Reducción de la

dimensionalidad. En [Gomez03] se usa la información de ganancia para filtrar aquellos términos que no parecen relevantes para la detección de pornografía, y [Chou08] compara tres medidas de calidad para tres ámbitos distintos de filtrado en el lugar de trabajo (noticias, compras y deportes). 2. Aprendizaje del clasificador. Se ha probado una cantidad significativa de métodos de Aprendizaje Automático para el filtrado de contenido. Entre los más comúnmente usados se encuentran los siguientes: varias versiones de Naïve Bayes ([Chou08], [Polpinij06] y [Gomez03]); árboles de decisión ([Guermazi07], [Chou08] y [Gomez03]); mecanismos de aprendizaje perezoso como k Nearest Neighbors –el método del vecino más cercano– ([Chou08] y [Su04]); redes neuronales ([Chou08] y [Lee02]); y Support Vector Machines (Agarwal06], [Chou08], [Greevy04], [Kim06], [Polpinij06] y [Gomez03]). 3. Evaluación. Los trabajos presentados en este campo son totalmente heterogéneos, por lo que resulta imposible comparar los resultados de evaluación. Todas las investigaciones hacen uso de conjuntos privados de páginas web sobre asuntos muy diferentes (pornografía, racismo, violencia, uso indebido en el lugar de trabajo, etc.), y en diversos lenguajes (inglés, español, chino, italiano, etc.). Los tamaños de las colecciones varían, yendo desde unos pocos cientos a varios miles. Además, la

58

APLICACIONES

métrica

de

evaluación

es

diversa:

precisión

y

recall

(recuperación de la información), medición F, acierto y error, o método ROCHH, que es el método que proponemos en esta Tesis como el mejor para este tipo de tareas y que se describe más adelante en los artículos.

Aplicación al Filtrado de spam SMS El servicio de mensajes cortos (SMS) es un servicio de comunicación mediante texto utilizado en comunicaciones móviles. El texto que puede enviarse en cada mensaje está limitado a 160 carácteres, y se utilizan comúnmente entre los usuarios de teléfonos móviles, como un sustituto para las llamadas de voz en situaciones en las que la comunicación de voz es imposible o no deseable. Esta forma de comunicación es también muy popular debido a que en algunos lugares los mensajes de texto son mucho más barato que hacer una llamada de teléfono. Los mensajes SMS han sido durante mucho tiempo un negocio para las operadoras de telecomunicaciones, suponiendo en todo el mundo los mayores ingresos si quitamos las comunicaciones de voz El problema del spam SMS El crecimiento de usuarios de dispositivos móviles y el abaratamiento del precio de envío mensajes debido a la competencia y al surgimiento de otros sistemas de comunicación como whatsApp o Line, ha llevado a un dramático aumento de los mensajes spam SMS. Informes recientes indican que el volumen de spam en el móvil está

59

APLICACIONES

aumentando considerablemente año tras año. En la práctica, la lucha contra esta plaga es difícil por varios factores, entre ellos la menor tasa de SMS, si lo comparamos con los volúmenes manejados en spam en correo electrónico, lo que ha llevado a que muchos usuarios y proveedores de servicios ignoren el problema, y que ha llevado también a una limitada disponibilidad de software de filtrado de spam móvil. Además, debido a la longitud de los mensajes, de menos de 160 caracteres, y a la jerga usada para ahorrar letras, la fase de ingeniería de atributos en la tarea de filtrado de spam SMS se vuelve aún más crítica. A todo esto hay que unir que, debido a la creciente popularidad de los servicios de mensajería instantánea tipo WhatsApp o Line y su gratuidad, veamos que en los próximos años el spam se traslade a estos servicios, dónde estas técnicas pueden seguir siendo aplicadas. Estructura de procesado del filtrado spam SMS En esta tarea, debido a la escasa longitud de los textos de los mensajes y a la escasez de colecciones de evaluación reconocidas, los procesos de Ingeniería de Atributos y de evaluación se vuelven críticos, sobre todo el primero. Los pasos que se realizan y que se describen con detalle en el artículo son los siguientes: 1. Ingeniería de Atributos. La tokenización es probablemente el paso más importante en el proceso de análisis de mensajes para aprendizaje de un clasificador. Una mala representación de los datos del problema puede llevar a un clasificador con mala calidad y precisión. Hemos utilizado el siguiente conjunto de atributos para representar mensajes SMS (spam o legítimo):

60

APLICACIONES



Palabras: Secuencias de caracteres alfanuméricos en el texto del mensaje. Consideramos que cualquier carácter no alfanumérico es un separador. Las palabras son los componentes básicos de cualquier mensaje.



Las palabras en minúsculas: palabras en minúsculas en el mensaje de texto, de acuerdo con la definición anterior de palabra. De esta manera, asignamos diferentes cadenas a los mismos atributos, obteniendo nuevas frecuencias de tokens, que afectan en el proceso de aprendizaje.



Bi-gramas y tri–gramas de caracteres: Secuencias de 2 o 3 caracteres incluidos en cualquier palabra en minúsculas. Estos atributos tratan de capturar la varianza morfológica de forma independiente del lenguaje.



Bi-gramas de palabras: Secuencias de 2 palabras en una ventana de 5 palabras que preceden a la palabra actual. Por un lado, se ha demostrado que las dependencias más relevantes entre las palabras en un lenguaje están presentes sólo en una ventana de tamaño 5, es decir, rara vez una palabra influye en otra que está más allá de 5 palabras de separación.

2. Aprendizaje del clasificador. Para esta trea se han probado varios algoruimos de Aprendiozaje entre los que s eencuentran Naïve Bayes, C4.5 PART o Support Vector Machines. Los resultados obtenidos con cada uno de ellos pueden verse más adelante en el artículo dedicado a esta tarea.

61

APLICACIONES

3. Evaluación: Cómo ya ocurriera en las otras dos aplicaciones vistas en esta Tesis, la evaluación es un paso clave para comprobar la validez del trabajo. Tenemos que realizar una evaluación orientada a una serie de condiciones variables, en términos de distribución de clase (y equilibrio) y el coste de los falsos positivos (y negativos), que no son conocidos de antemano. El método de evaluación más adecuado para un entorno con estas características es el Receiver Operating Characteristic Convex Hull. Ya hemos utilizado este método en los experimentos anteriores en filtrado de correo no deseado, y filtrado de contenido Web. Este método se ha convertido en un estándar en el filtrado de correo electrónico no deseado, y se describe en los artículos de la Tesis.

62

ARTÍCULOS

V.

ARTÍCULOS

A continuación presentamos los 4 artículos de referencia que se han seleccionado por su impacto y contribución a línea general de esta Tesis. El orden de presentación no es cronológico, sino que se muestran de acuerdo a la estructura lógica que se ha usado a lo largo de este trabajo.

63

ARTÍCULOS

64

ARTÍCULOS

Artículo 1: Email Spam Filtering Puertas Sanz, E., Gómez Hidalgo, J. M., Cortizo Pérez, J. C. (2008). Email Spam Filtering. Advances in Computers. Elsevier Academic Press, (Vol. 74 pp.45-114) Impacto Este artículo de revista ha sido publicado por la editorial Elsevier. Se encuentra indexado en índices científicos de referencia como Journal Citation Reports (JCR), en el cuarto cuartil de la categoría “Computer science, Software Engineering”, y en Scimago Journal Rank (SJR), en el segundo cuartil de la categoría “Computer Science”. Resumen En este artículo presentamos en este capítulo el problema del spam, cómo nos afecta y cómo podemos luchar contra él; analizamos las medidas legales, económicas y técnicas usadas para detener estos correos no solicitados y, de entre todas la medidas técnicas, nos centramos en las basadas en análisis de contenido, visto que estas se han mostrado particularmente eficaces en el filtrado del spam, y explicamos en detalle cómo funcionan. En concreto, explicamos la estructura y el proceso de distintos métodos de Aprendizaje Automático usados para esta tarea, así como la forma de hacerlos sensibles al coste por medio de diferentes métodos, como aquellos basados en Optimización del Umbral, los basados en Pesos de las Instancias o el MetaCost. También tratamos la forma de evaluar los filtros de spam usando métricas básicas, Métrica TREC y el método

65

ARTÍCULOS

Receiver Operating Characteristic Convex Hull, que se ajusta mejor a la clasificación de aquellos problemas en los que las condiciones objetivo no se conocen, como es nuestro caso, y analizamos cómo se usan los filtros reales en la práctica. También presentamos distintos métodos utilizados por los spammers para atacar a los filtros de spam y adelantamos qué podemos esperar encontrar durante los próximos años en la batalla que libran los filtros de spam contra los spammers. Aportaciones En este artículo hacemos una revisión de estado del arte en filtros de correo basura y mostramos que, usando Ingeniería de Atributos para la indexación y representación de los mensajes, podemos mejorar la eficiencia en la tareas de filtrado de correo basura. Para ello hemos tenido en cuenta la asimetría de los costes de los errores, y hemos propuesto medidas de evaluación que se ajustan a esta peculiaridad. Más concretamente proponemos el uso del método Receiver Operating Characteristic Convex Hull

(ROCCH), que se ajusta mejor a la

clasificación de aquellos problemas en los que las condiciones objetivo no se conocen, como es nuestro caso, y que se ha convertido en un estándar para la evaluación de filtros spam en la reputada competiciones TREC (Text REtrieval Conferences).

66

Email Spam Filtering ENRIQUE PUERTAS SANZ Universidad Europea de Madrid Villaviciosa de Odo´n, 28670 Madrid, Spain

´ MEZ HIDALGO JOSE´ MARI´A GO Optenet Las Rozas 28230 Madrid, Spain

JOSE´ CARLOS CORTIZO PE´REZ AINet Solutions Fuenlabrada 28943, Madrid, Spain

Abstract In recent years, email spam has become an increasingly important problem, with a big economic impact in society. In this work, we present the problem of spam, how it affects us, and how we can fight against it. We discuss legal, economic, and technical measures used to stop these unsolicited emails. Among all the technical measures, those based on content analysis have been particularly effective in filtering spam, so we focus on them, explaining how they work in detail. In summary, we explain the structure and the process of different Machine Learning methods used for this task, and how we can make them to be cost sensitive through several methods like threshold optimization, instance weighting, or MetaCost. We also discuss how to evaluate spam filters using basic metrics, TREC metrics, and the receiver operating characteristic convex hull method, that best suits classification problems in which target conditions are not known, as it is the case. We also describe how actual filters are used in practice. We also present different methods used by spammers to attack spam filters and what we can expect to find in the coming years in the battle of spam filters against spammers.

ADVANCES IN COMPUTERS, VOL. 74 ISSN: 0065-2458/DOI: 10.1016/S0065-2458(08)00603-7

45

Copyright © 2008 Elsevier Inc. All rights reserved.

46

E.P. SANZ ET AL.

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 1.1. 1.2. 1.3. 1.4.

What is Spam? . . . . . . . . The Problem of Email Spam . Spam Families . . . . . . . . Legal Measures Against Spam

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

47 47 48 50

2. Technical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8. 2.9.

Primitive Language Analysis or Heuristic Content Filtering . . . . . White and Black Listings . . . . . . . . . . . . . . . . . . . . . . . Graylisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Digital Signatures and Reputation Control . . . . . . . . . . . . . . Postage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disposable Addresses . . . . . . . . . . . . . . . . . . . . . . . . . Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . Honeypotting and Email Traps . . . . . . . . . . . . . . . . . . . . Content-Based Filters . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

51 51 52 53 54 54 55 55 56

3. Content-Based Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1. 3.2. 3.3. 3.4.

Heuristic Filtering . . . . . . . . . . . . . . . . . . . Learning-Based Filtering . . . . . . . . . . . . . . . Filtering by Compression . . . . . . . . . . . . . . . Comparison and Summary . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

57 63 80 83

4. Spam Filters Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1. 4.2. 4.3.

Test Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Running Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5. Spam Filters in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1. 5.2. 5.3. 5.4.

Server Side Versus Client Side Filtering . . . . . . . . . . . Quarantines . . . . . . . . . . . . . . . . . . . . . . . . . . Proxying and Tagging . . . . . . . . . . . . . . . . . . . . . Best and Future Practical Spam Filtering . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

93 95 96 98

6. Attacking Spam Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1. 6.2. 6.3.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Indirect Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Direct Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7. Conclusions and Future Trends . . . . . . . . . . . . . . . . . . . . . . . . 109 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

EMAIL SPAM FILTERING

47

1. Introduction 1.1

What is Spam?

In literature, we can find several terms for naming unsolicited emails. Junk emails, bulk emails, or unsolicited commercial emails (UCE) are a few of them, but the most common word used for reference is ‘spam.’ It is not clear where do the word spam comes from, but many authors state that the term was taken from a Monty Python’s sketch, where a couple go into a restaurant, and the wife tries to get something other than spam. In the background are a bunch of Vikings that sing the praises of spam: ‘spam, spam, spam, spam . . . lovely spam, wonderful spam.’ Pretty soon the only thing you can hear in the skit is the word ‘spam.’ That same idea would happen to the Internet if large-scale inappropriate postings were allowed. You could not pick the real postings out from the spam. But, what is the difference between spam and legitimate emails? We can consider an email as spam if it has the following features: l l l

Unsolicited: The receiver is not interested in receiving the information. Unknown sender: The receiver does not know and has no link with the sender. Massive: The email has been sent to a large number of addresses.

In the next subsections, we describe the most prominent issues regarding spam, including its effects, types, and main measures against it.

1.2

The Problem of Email Spam

The problem of email spam can be quantified in economical terms. Many hours are wasted everyday by workers. It is not just the time they waste reading spam but also the time they spend deleting those messages. Let us think in a corporate network of about 500 hosts, and each one receiving about 10 spam messages every day. If because of these emails 10 min are wasted we can easily estimate the large number of hours wasted just because of spam. Whether an employee receives dozens or just a few each day, reading and deleting these messages takes time, lowering the work productivity. As an example, the United Nations Conference on Trade and Development estimates the global economic impact of spam could reach $20 b in lost time and productivity. The California legislature found that spam costs United States organizations alone more than $10 b in 2004, including lost productivity and the additional equipment, software, and

48

E.P. SANZ ET AL.

manpower needed to combat the problem. A repost made by Nucleus Research1 in 2004 claims that spam will cost US employers $2K per employee in lost productivity. Nucleus found that unsolicited email reduced employee productivity by a staggering 1.4%. Spam-filtering solutions have been doing little to control this situation, reducing spam levels by only 26% on average, according to some reports. There are also problems related to the technical problems caused by spam. Quite often spam can be dangerous, containing virus, trojans, or other kind of damaging software, opening security breaches in computers and networks. In fact, it has been demonstrated that virus writers hire spammers to disseminate their so-called malware. Spam has been the main means to perform ‘phishing’ attacks, in which a bank or another organization is supplanted in order to get valid credentials from the user, and steal his banking data leading to fraud. Also, network and email administrators have to employ substantial time and effort in deploying systems to fight spam. As a final remark, spam is not only dangerous or a waste of time, but also it can be quite disturbing. Receiving unsolicited messages is a privacy violation, and often forces the user to see strongly unwanted material, including pornography. There is no way to quantify this damage in terms of money, but no doubt it is far from negligible.

1.3

Spam Families

In this subsection, we describe some popular spam families or genres, focusing on those we have found most popular or damaging.

1.3.1 Internet Hoaxes and Chain Letters There are a whole host of annoying hoaxes that circulate by email and encourage you to pass them on to other people. Most hoaxes have a similar pattern. These are some common examples to illustrate the language used: l

l

l

1

Warnings about the latest nonexistent virus dressed up with impressive but nonsensical technical language such as ‘nth-complexity binary loops.’ Emails asking to send emails to a 7-year-old boy dying of cancer, promises that one well-known IT company’s CEO or president will donate money to charity for each email forwarded. Messages concerning the Helius Project, about a nonexistent alien being communicating with people on Earth, launched in 2003 and still online. Many people who interacted with Helius argue that Helius is real. See http://www.nucleusresearch.com for more information.

EMAIL SPAM FILTERING

49

In general, messages that says ‘forward this to everyone you know!’ are usually hoaxes or chain letters. The purpose of these letters is from joking to generating important amounts of network traffic that involves economic losses in ISPs.

1.3.2 Pyramid Schemes This is a common attempt to get money from people. Pyramid schemes are worded along the lines of ‘send ten dollars to this address and add yourself to the bottom of the list. In six weeks you’ll be a millionaire!’ They do not work (except from the one on the top of the pyramid, of course). They are usually illegal and you will not make any money from them.

1.3.3 Advance Fee Fraud Advance fee fraud, also known as Nigerian fraud or 419 fraud, is a particularly dangerous spam. It takes the form of an email claiming to be from a businessman or government official, normally in a West African state, who supposedly has millions of dollars obtained from the corrupt regime and would like your help in getting it out of the country. In return for depositing the money in your bank account, you are promised a large chunk of it. The basic trick is that after you reply and start talking to the fraudsters, they eventually ask you for a large sum of money up front in order to get an even larger sum later. You pay, they disappear, and you lose.

1.3.4 Commercial Spam This is the most common family of spam messages. They are commercial advertisements trying to sell a product (that usually cannot be bought in a regular store). According to a report made by Sophos about security threats in 2006,2 health- and medical-related spam (which primarily covers medication which claims to assist in sexual performance, weight loss, or human growth hormones) remained the most dominant type of spam and rose during the year 2006. In the report we find the top categories in commercial spam: l l

l

2

Medication/pills – Viagra, Cialis, and other sexual medication. Phishing scams – Messages supplanting Internet and banking corporations like Ebay, Paypal, or the Bank of America, in order to get valid credentials and steal users’ money. Non-English language – an increasing number of commercial spam is translated or specifically prepared for non-English communities. See the full report at http://www.sophos.com.

50 l

l

l l

E.P. SANZ ET AL.

Software – or how you can get very cheap, as it is OEM (Original Equipment Manufacturer), that is, prepared to be served within a brand new PC at the store. Mortgage – a popular category, including not only mortgages but also specially debt grouping. Pornography – one of the most successful businesses in the net. Stock scams – interestingly, it has been observed that promoting stock corporations via email has had some impact in their results.

The economics behind the spam problem are clear: if users did not buy products marketed through spam, it would not be such a good business. If you are able to send 10 million messages for a 10 dollars product, and you get just one sell among every 10 thousand messages, you will be getting 10 thousand dollars from your spam campaign. Some spammers have been reported earning around five hundred thousand dollars a month, for years.

1.4

Legal Measures Against Spam

Fighting spam requires uniform international laws, as the Internet is a global network and only uniform global legislation can combat spam. A number of nations have implemented legal measures against spam. The United States of America has both a federal law against spam and a separate law for each state. Something similar can be found in Europe: the European Union has its antispam law but most European countries have its own spam law too. There are specially effective or very string antispam laws like those in Australia, Japan, and South Korea. There are also bilateral treaties on spam and Internet fraud, as those between the United States and Mexico or Spain. On the other side, there are also countries without specific regulation about spam so it is an activity that is not considered illegal. With this scenario, it is very difficult to apply legal measures against spammers. Besides that, anonymity is one of the biggest advantages of spammers. Spammers frequently use false names, addresses, phone numbers, and other contact information to set up ‘disposable’ accounts at various Internet service providers. In some cases, they have used falsified or stolen credit card numbers to pay for these accounts. This allows them to quickly move from one account to the next as each one is discovered and shut down by the host ISPs. While some spammers have been caught (a noticeable case is that of Jeremy Jaynes), there are many spammers that have avoided their capture for years. A trustable spammers’ hall of fame is maintained by The Spamhaus Project, and it is known as the Register Of Known Spam Operations (ROKSO).3

3

The ROKSO can be accessed at http://www.spamhaus.org/rokso/index.lasso.

EMAIL SPAM FILTERING

2.

51

Technical Measures

Among all the different techniques used for fighting spam, technical measures have become the most effective. There are several approaches used to filter spam. In the next section, we will comment some of the most popular approaches.

2.1

Primitive Language Analysis or Heuristic Content Filtering

The very first spam filters used primitive language analysis techniques to detect junk email. The idea was to match specific texts or words to email body or sender address. In the mid 1990s when spam was not the problem that it is today, users could filter unsolicited emails by scanning them, searching for phrases or words that were indicative of spam like ‘Viagra’ or ‘Buy now.’ Those days spam messages were not as sophisticated as they are today and this very simplistic approach could filter ~80% of the spam. The first versions of the most important email clients included this technique that it worked quite well for a time, before spammers started to use their tricks to avoid filters. The way they obfuscated messages made this technique ineffective. Another weakness of this approach was the high false-positive rate: any message containing ‘forbidden words’ was sent to trash. Most of those words were good for filtering spam, but sometimes they could appear in legitimate emails. This approach is not used nowadays because of the low accuracy and the high error rates it has. This primitive analysis technique is in fact a form of content analysis, as it makes use of every email content to decide if it is spam or legitimate. We have called this technique heuristic filtering, and it is extensively discussed below.

2.2

White and Black Listings

White and black lists are extremely popular approaches to filter spam email [41]. White lists state which senders’ messages are never considered spam, and black lists include those senders that should always be considered spammers. A white list contains addresses that are supposed to be safe. These addresses can be individual emails, domain names, or IP addresses, and it would filer an individual sender or a group of them. This technique can be used in the server side and/or in the client side, and is usually found as a complement to other more effective approaches. In server-side white lists, an administrator has to validate the addresses before they go to the trusted list. This can be feasible in a small company or a server with a small number of email accounts, but it can turn into a pain if pretended to be used in large corporate servers with every user having his own white list. This is because the

52

E.P. SANZ ET AL.

task of validating each email that is not in the list is a time-consuming job. An extreme use of this technique could be to reject all emails coming from senders that are not in the white list. This could sound very unreasonable, but it is not. It can be used in restricted domains like schools, where you prefer to filter emails from unknown senders but want to keep the children away from potentially harmful content, because spam messages could contain porn or another kind of adult content. In fact, this aggressive antispam technique has been used by some free email service providers as Hotmail, in which a rule can be stated preventing any message coming from any other service to get into their users mailboxes. White listings can also be used in the client side. In fact, one of the first techniques used to filters spam consisted of using user’s address book as a white list, tagging as potential spam all those emails that had in the FROM: field an address that was not in the address book. This technique can be effective for those persons who use email just to communicate with a limited group of contacts like family and friends. The main problem of white listings is the assumption that trusted contacts do not send junk email and, as we are going to see, this assumption could be erroneous. Many spammers use computers that have been compromised using trojans and viruses for sending spam, sending them to all the contacts of the address book, so we could get a spam message from a known sender if his computer has been infected with a virus. Since these contacts are in the white list, all messages coming from them are flagged as safe. Black listings, most often known as DNS Blacklists (DNSBL), are used to filter out emails which are sent by known spam addresses or compromised servers. The very first and most popular black list has been the trademarked Realtime Blackhole List (RBL), operated by the Mail Abuse Prevention System. System administrators, using spam detection tools, report IP addresses of machines sending spam and they are stored in a common central list that can be shared by other email filters. Most antispam softwares have some form of access to networked resources of this kind. Aggressive black listings may block whole domains or ISPs having many false positives. A way to deal with this problem is to have several distributed black listings and contrast sender’s information against some of them before blocking an email. Current DNS black lists are dynamic, that is, not only grow with new information, but also expire entries, maintaining fresh reflection of current situation in the address space.

2.3

Graylisting

As a complement to white and black listings, one could use graylistings [94]. The core behind this approach is the assumption that junk email is sent using spam bots, this is specific software made to send thousands of emails in a short time. This software differs from traditional email servers and does not respect email RFC standards. In particular, emails that fail to reach its target are not sent again, as a real system would

EMAIL SPAM FILTERING

53

do. This is the right feature used by graylistings. When the system receives an email from an unknown sender that is not in a white listing, it creates a tupla sender–receiver. The first time that tupla occurs in the system, the email is rejected so it is bounced back to the sender. A real server will send that email again so the second time the system finds the tupla, the email is flagged as safe and delivered to the recipient. Graylistings have some limitations and problems. The obvious ones are the delay we could have in getting some legitimate emails when using this approach because we have to wait until the email is sent twice, and the waste of bandwidth produced in the send–reject–resend process. Other limitations are that this approach will not work when spam is sent from open relays as they are real email servers and the easy way for spammer to workaround graylistings, just adding a new functionality to the software, allowing it to send bounced emails again.

2.4

Digital Signatures and Reputation Control

With the emergence of Public Key Cryptography, and specifically, its application to email coding and signing, most prominently represented by Pretty Good Privacy (PGP) [103] and GNU Privacy Guard (GPG) [64], there exists the possibility of filtering out unsigned messages, and in case they are signed, those sent by untrusted users. PGP allows keeping a nontrivial web/chain of trust between email users, the way that trust is spread over a net of contacts. This way, a user can trust the signer of a message if he or she is trusted by another contact of the email receiver. The main disadvantage of this approach is that PGP/GPG users are rare, so it is quite risky to consider legitimate email coming only from trusted contacts. However, it is possible to extend this idea to email servers. As we saw in previous section, many email filters use white listings to store safe senders, usually local addresses and addresses of friends. So if spammers figure out who our friends are, they could forge the FROM: header of the message with that information, avoiding filters because senders that are in white listings are never filtered out. Sender Policy Framework (SPF), DomainKeys, and Sender ID try to prevent forgery by registering IPs of machines used to send email from for every valid email sender in the server [89]. So if someone is sending an email from a particular domain but it does not match the IP address of the sender, you can know the email has been forged. The messages are signed by the public key of the server, which makes its SPF, DomainKeys, or Sender ID record public. As more and more email service providers (specially the free ones, like Yahoo!, Hotmail, or Gmail) are making their IP records public, the approach will be increasingly effective. Signatures are a basic implementation of a more sophisticated technique, which is reputation control for email senders. When the system receives an email from an unknown sender, the message is scanned and classified as legitimate or spam. If the

54

E.P. SANZ ET AL.

email is classified as legitimate, the reputation of the sender is increased, and decreased if classified as spam. The more emails are sent from that address, the more positive or negative the sender is ranked. Once reputation crosses a certain threshold, it can be moved to a white or black list. The approach can be extended to the whole IP space in the net, as current antispam products by IronPort feature, named SenderBase.4

2.5

Postage

One of the main reasons of spam success is the low costs of sending spam. Senders do not have to pay for sending email and costs of bandwidth are very low even if sending millions of emails.5 Postage is a technique based upon the principle of senders of unsolicited messages demonstrating their goodwill by paying some kind of postage: either a small amount of money paid electronically, a sacrifice of a few seconds of human time at answering a simple question, or some time of computation in the sender machine. As the email services are based on the Simple Mail Transfer Protocol, economic postage requires a specific architecture over the net, or a dramatic change in the email protocol. Abadi et al. [1] describes a ticket-based client–server architecture to provide postage for avoiding spamming (yet other applications are suitable). An alternative is to require the sender to answer some kind of question, to prove he is actually a human being. This is the kind of Turing Test [93] that has been implemented in many web-based services, requiring the user to type the hidden word in a picture. These tests are named CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) [2]. This approach can be especially effectively used to avoid outgoing spam, that is, preventing the spammers to abuse of free email service providers as Hotmail or Gmail. A third approach is requiring the sender machine to solve some kind of computationally expensive problem [32], producing a delay and thus, disallowing spammers to send millions of messages per day. This approach is, by far, the less annoying of the postage techniques proposed, and thus, the most popular one.

2.6 Disposable Addresses Disposable addresses [86] are a technique used to prevent a user to receive spam. It is not a filtering system itself but a way to avoid spammers to find out our address. To harvest email addresses, spammers crawl the web searching for addresses in web 4

The level of trust of an IP in SenderBase can be checked at: http://www.senderbase.org/. Although the case against the famous spammer Jeremy Jaynes has revealed he has been spending more than one hundred thousand dollars per month in high speed connections. 5

EMAIL SPAM FILTERING

55

pages, forums, or Usenet groups. If we do not publish our address on the Internet, we can be more or less protected against spam, but the problem is when we want to register in a web page or an Internet service and we have to fill in our email address in a form. Most sites state that they will not use that information for sending spam but we cannot be sure and many times the address goes to a list that is sold to third party companies and used for sending commercial emails. Moreover, these sites can be accessed by hackers with the purpose of collecting valid (and valuable) addresses. To circumvent this problem, we can use disposable email addresses. Instead of letting the user prepare his own disposable addresses, he can be provided with an automatic system to manage them, like the channels’ infrastructure by ATT [50]. The addresses are temporary accounts that the user can use to register in web services. All messages sent to disposable address are redirected to our permanent safe account during a configurable period of time. Once the temporary address is no longer needed, it is deleted so even if that account receives spam, this is not redirected.

2.7

Collaborative Filtering

Collaborative filtering [48] is a distributed approach to filter spam. Instead of having each user to have his own filter, a whole community works together. Using this technique, each user shares his judgments of what is spam and what is not with the other users. Collaborative filtering networks take advantage of the problem of some users that receive spam to build better filters for those that have not yet received those spam messages. When a group of users in the same domain have tagged an email coming from a common sender as spam, the system can use the information in those emails to learn to classify those particular emails so the rest of users in the domain will not receive them. The weakness of this approach is that what is spam for somebody could be a legitimate content for another. These collaborative spam filters cannot be more accurate as a personal filter in the client side but it is an excellent option for filtering in the server side. Another disadvantage of this approach is that spammers introduce small variations in the messages, disallowing the identification of a new upcoming spam email as a close variation of one receiver earlier by another user [58].

2.8

Honeypotting and Email Traps

Spammers are known to abuse vulnerable systems like open mail relays and public open proxies. In order to discover spam activities, some administrators have created honeypot programs that simulate being such vulnerable systems. The existence of such fake systems makes more risky for spammers to use open relays

56

E.P. SANZ ET AL.

and open proxies for sending spam. Honeypotting is a common technique used for system administrators to detect hacking activities on their servers. They create fake vulnerable servers in order to burst hackers while protecting the real servers. Since the term honeypotting is more appropriate for security environments, the terms ‘email traps’ or ‘spam traps’ can be used instead of referring to these techniques when applied to prevent spam. Spam traps can used to collect instances of spam messages on keeping a fresh collection of spammer techniques (and a better training collection in learning-based classifiers), to build and deploy updated filtering rules in heuristic filters, and to detect new spam attacks in advance, avoiding them reach, for example, a corporate network in particular.

2.9

Content-Based Filters

Content-based filters are based on analyzing the content of emails. These filters can be hand-made rules, also known as heuristic filters, or learned using Machine Learning algorithms. Both approaches are widely used these days in spam filters because they can be very accurate when they are correctly tuned up, and they are going to be deeply analyzed in next section.

3.

Content-Based Spam Filtering

Among the technical measures to control spam, content-based filtering is one of the most popular ones. Spam filters that analyze the contents of the messages and take decisions on that basis have spread among the Internet users, ranging from individual users at their home personal computers, to big corporate networks. The success of content-based filters is so big that spammers have performed increasingly complex attacks designed to avoid them and to reach the users’ mailbox. This section covers the most relevant techniques for content-based spam filtering. Heuristic filtering is important for historical reasons, although the most popular modern heuristic filters have some learning component. Learning-based filtering is the main trend in the field; the ability to learn from examples of spam and legitimate messages gives these filters full power to detect spam in a personalized way. Recent TREC [19] competitive evaluations have stressed the importance of a family of learning-based filters, which are those using compression algorithms; they have scored top in terms of effectiveness, and so they deserve a specific section.

EMAIL SPAM FILTERING

3.1

57

Heuristic Filtering

Since the very first spam messages, users (that were simultaneously their own ‘network administrators’) have coded rules or heuristics to separate spam from their legitimate messages, and avoid reading the first [24]. A content-based heuristic filter is a set of hand-coded rules that analyze the contents of a message and classify it as spam or legitimate. For instance, a rule may look like: if ðP‘Viagra’ 2 MÞorð‘VIAGRA’ 2 MÞthen classðMÞ ¼ spam This rule means that if any of the words ‘Viagra’ or ‘VIAGRA’ (that are in fact distinct characters strings) occur in a message M, then it should be classified as spam. While first Internet users were often privileged user administrators and used this kind of rules in the context of sophisticated script and command languages, most modern mail user clients allow writing this kind of rules through simple forms. For instance, a Thunderbird straightforward spam filter is shown in Fig. 1. In this example, the users has prepared a filter named ‘spam’ that deletes all messages in which the word ‘**spam**’ occurs in the Subject header.

FIG. 1. A simple spam filter coded as a Thunderbird mail client rule. If the word ‘**spam**’ occurs in the Subject of a message, it will be deleted (sent to trash).

58

E.P. SANZ ET AL.

However, these filtering utilities are most often used to classify the incoming messages into folders, according to their sender, their topic, or the mail list they belong to. They can also be used in conjunction with a filtering solution out of the mail client, which may tag spam messages (for instance, with the string ‘**spam**’ in the subject, or being more sophisticate, by adding a specific header like, for example, X-mail report, which can include a simple tag or a rather informative output with even a score), that will be later processed by the mail client by applying the suitable filters and performing the desired action. A sensible action is to send the messages tagged as spam to a quarantine folder in order to avoid false positives (legitimate messages classified as spam). It should be clear that maintaining an effective set of rules can be a rather time-consuming job. Spam messages include offers of pharmacy products, porn advertisements, unwanted loans, stock recommendations, and many other types of messages. Not only their content, but their style is always changing. In fact, it is hard to find a message in which the word ‘Viagra’ occurs without alterations (except for a legitimate one!). In other words, there are quite many technical experts highly committed to make the filter fail: the spammers. This is why spam filtering is considered a problem of ‘adversarial classification’ [25]. Neither a modern single network administrator nor even an advanced user will be writing their own handmade rules to filter spam. Instead, a list of useful rules can be maintained by a community of expert users and administrators, as it has been done in the very popular open-source solution SpamAssassin or in the commercial service Brightmail provided by Symantec Corporation. We discuss these relevant examples in the next subsections, finishing this section with the advantages and disadvantages of this approach.

3.1.1 The SpamAssassin Filter While it is easy to defeat a single network administrator, it is harder to defeat a community. This is the spirit of one of the most spread heuristic filtering solutions: SpamAssassin [87]. This filter has received a number of prices, and as a matter of example, it had more than 3600 downloads in the 19 months the project was hosted at Sourceforge6 (February 2002–September 2003). SpamAssassin is one of the oldest still-alive filters in the market, and its main feature (for the purpose of our presentation) is its impressive set of rules or heuristics, contributed by tens of administrators and validated by the project 6 Sourceforge is the leading hosting server for open-source projects, providing versioning and downloading services, statistics, and more. See: http://sourceforge.net.

EMAIL SPAM FILTERING

59

committee. The current (version 3.2) set of rules (named ‘tests’ in SpamAssassin) has 746 tests.7 Some of them are administrative, and a number of them are not truly ‘content-based,’ as they, for example, check the sender address or IP against public white lists. For instance, the test named ‘RCVD_IN_DNSWL_HI’ checks if the sender is listed in the DNS Whitelist.8 Of course, this is a white listing mechanism, and it makes nearly no analysis of the very message content. On the other side, the rule named ‘FS_LOW_RATES’ tests if the Subject field contains the words ‘low rates,’ which is very popular in spam messages dealing with loans or mortgages. Many SpamAssassin tests address typing variability by using quite sophisticated regular expressions. We show a list of additional examples in the Fig. 2, as they are presented in the project web page. A typical SpamAssassin content matching rule has the structure shown in the next example: body DEMONSTRATION_RULE /test/ score DEMONSTRATION_RULE 0.1 describe DEMONSTRATION_RULE This is a simple test rule

The rule starts with a line that describes the test to be performed, it goes on with line presenting the score, and it has a final line for the rule description. The sample rule name is ‘DEMONSTRATION_RULE,’ and it checks the (case sensitive) occurrence of the word ‘test’ in the body section of an incoming email message. If the condition is

FIG. 2. A sample of SpamAssassin test or filtering rules. The area tested may be the header, the body, etc. and each test is provided with one or more scores that can be used to set a suitable threshold and vary the filter sensitivity. 7

The lists of tests used by SpamAssassin are available at: http://spamassassin.apache.org/tests.

html. 8

The DNS Whitelist is available at: http://www.dnswl.org/.

60

E.P. SANZ ET AL.

satisfied, that is, the word occurs, then the score 0.1 is added to the message global score. The score of the message may be incremented by other rules, and the message will be tagged as spam if the global score exceeds a manually or automatically set threshold. Of course, the higher the score of a rule, the more it contributes to the decision of tagging a message as spam. The tests performed in the rules can address all the parts in a message, and request preprocessing or not. For instance, if the rule starts with ‘header,’ only the headers will be tested: header DEMONSTRATION_SUBJECT Subject = /test/

In fact, the symbols ‘=~’ preceding the test, along with the word ‘Subject,’ mean that only the subject header will be tested. This case, the subject field name is case insensitive. The tests performed allow complex expressions written in the Perl Regular Expressions (Regex) Syntax. A slightly more complex example may be: header DEMONSTRATION_SUBJECT Subject = /\btest\b/i

In this example, the expression ‘/ \btest\b/i’ means that the word ‘test’ will be searched as a single word (and not as a part of others, like ‘testing’), because it starts and finishes with the word-break mark ‘\b,’ and the test will be case insensitive because of the finishing mark ‘/i.’ Of course, regular expressions may be much more complex, but covering them in detail is beyond the scope of this chapter. We suggest [54] for the interested reader. Manually assigning scores to the rules is not a very good idea, as the rule coder must have a precise and global idea of all the scores in all the rest of the rules. Instead, an automated method is required, which should be able to look at all the scores and a set of testing messages, and compute the scores that minimize the error of the filter. In versions 2.x, the scores of the rules have been assigned using a Genetic Algorithm, while in (current) versions 3.x, the scores are assigned using a neural network trained with error back propagation (a perceptron). Both systems attempt to optimize the effectiveness of the rules that are run in terms of minimizing the number of false positives and false negatives, and they are presented in [87] and [91], respectively. The scores are optimized on a set of real examples contributed by volunteers. The SpamAssassin group has in fact released a corpus of spam messages, the sonamed SpamAssassin Public Corpus.9 This corpus includes 6047 messages, with ~31% spam ratio. As it has been extensively used for the evaluation of content-based spam filter, we leave a more detailed description of it for Section 4.

9

The SpamAssassin Public Corpus is available at: http://spamassassin.apache.org/publiccorpus/.

61

EMAIL SPAM FILTERING

3.1.2 The Symantec Brightmail Solution Some years ago, Brightmail emerged as an antispam solution provider based on an original business model in the antispam market. Instead of providing a final software application with filtering capabilities, it focused more on the service, and took the operation model from antivirus corporations: analysts working 24 h a day on new attacks, and frequent delivering of new protection rules. The model succeeded, and on June 21st, 2004, Symantec Corporation acquired Brightmail Incorporated, with its solution and its customers. Nowadays, Symantec claims that Brightmail Anti-spam protects more than 300 million users, and filters over 15% of the worldwide emails. The Brightmail Anti-spam solution works at the clients’ gateway, scanning incoming messages to the corporation, and deciding if they are spam or not. The decision is taken on the basis of a set of filtering rules provided by the experts working in the Symantec operations center, named BLOC (Brightmail Logistics Operations Center). The operational structure of the solution is shown in Fig. 3. In this figure, circles are used to denote the next processing steps: 1. The Probe Network (TM) is a network of fake email boxes (‘spam traps’ or ‘honeypots’) that have been seeded with the only purpose of receiving spam. These email addresses can be collected only by automatic means, as spammers

Symantec Operations Center Probe NetworkTM Spam

Corporation place

BLOCTM Collected spam

Filter hand coding 27⫻7 service

Internet

Reputation data collected

Legitimate Spam email

2

Filter validation

Secure filter trans. 3

4 Symantec Brightmail anti-spam

Email Gateway

Filter distribution 1

Email server

quarrantine

5

Spam 6

User mailbox

FIG. 3. Operational structure of the Symantec Brightmail Anti-spam solution. The numbers in circles denote processes described.

62

2.

3. 4. 5.

6.

E.P. SANZ ET AL.

do, and in consequence, they can receive only spam emails. The spam collected is sent to the BLOC. At the BLOC, groups of experts and, more recently, content-based analysis automatic tools build, validate, and distribute antispam filters to Symantec corporate customers. Every 10 min, the Symantec software downloads the most recent version of the filters from the BLOC, in order to keep them as updated as possible. The software filters the incoming messages using the Symantec and the usercustomized filters. The email administrator determines the most suitable administration mode for the filtered email. Most often, the (detected as) spam email is kept into a quarantine where users can check if the filter has mistakenly classified a legitimate message as spam (a false positive). The users can report undetected spam to Symantec for further analysis.

The processes at the BLOC were once manual, but the ever-increasing number of spam attacks has progressively made impossible to approach filter building as a hand-made task. In the recent times, spam experts in the BLOC have actually switched their role to filter adapters and tuners, as the filters are being produced by using the automatic, learning-based tools described in the next section.

3.1.3 Problems in Heuristic Filtering Heuristic content-based filtering has clear advantages over other kinds of filtering, especially those based on black and white listing. The most remarkable one is that it filters not only on the ‘From’ header, but also it can make use of the entire message, and inconsequence, to make a more informed decision. Furthermore, it offers a lot of control on the message information that is scanned, as the filter programmer decides which areas to scan, and what to seek. However, heuristic filtering has two noticeable drawbacks. First, writing rules is not an easy task. Or it has to be left on the hands of an experienced email administrator, or it has to be simplified via the forms in commercial mail clients as described above. The first case usually involves some programming, probably including a bit of regular expression definition, which is hard and error-prone. The second one implies a sacrifice of flexibility to gain simplicity. The second drawback is that, even being the rules written by a community of advanced users or administrators, the number of spammers is bigger, and moreover, they have a strong economic motivation to design new methods to avoid detection. In this arms race, the spammers will be always having the winning hand if the work of administrators is not supported with automatic (learning-based) tools as those we describe in the next section.

EMAIL SPAM FILTERING

3.2

63

Learning-Based Filtering

During the past 9 years, a new paradigm of content-based spam filtering has emerged. Bayesian filters, or more in general, learning-based filters, have the ability to learn from the email flow and to improve their performance over time, as they can adapt themselves to the actual spam and legitimate email a particular user receives. Their impressive success is demonstrated by the deep impact they have had on spam email, as the spammers have to costly change their techniques in order to avoid them. Learning-based filters are the current state of the art of email filtering, and the main issue in this chapter.

3.2.1 Spam Filtering as Text Categorization Spam filtering is an instance of a more general text classification task named Text Categorization [85]. Text Categorization is the assignment of text documents to a set of predefined classes. It is important to note that the classes are preexistent, instead of being generated on the fly (what corresponds to the task of Text Clustering). The main application of text categorization is the assignment of subject classes to text documents. Subject classes can be web directory categories (like in Yahoo!10), thematic descriptors in libraries (like the Library of Congress Subject Headings11 or, in particular domains, the Medical Subject Headings12 by the National Library of Medicine, or the Association for Computing Machinery ACM’s Computing Classification System descriptors used in the ACM Digital Library itself ), personal email folders, etc. The documents may be Web sites or pages, books, scientific articles, news items, email messages, etc. Text Categorization can be done manually or automatically. The first method is the one used in libraries, where expert catalogers scan new books and journals in order to get them indexed according to the classification system used. For instance, the National Library of Medicine employs around 95 full-time cataloguers in order to index the scientific and news articles distributed via MEDLINE.13 Obviously this is a time- and money-consuming task and the number of publications is always increasing. The increasing availability of tagged data has allowed the application of Machine Learning methods to the task. Instead of hand classifying the documents, or manually building a classification system (as what we have named a heuristic filter 10

Available at: http://www.yahoo.com/. Available at: http://www.loc.gov/cds/lcsh.html. 12 Available at: http://www.nlm.nih.gov/mesh/. 13 James Marcetich, Head of the Cataloguing Section of the National Library of Medicine, in personal communication (July 18, 2001). 11

64

E.P. SANZ ET AL.

above), it is possible to automatically build a classifier by using a Machine Learning algorithm on a collection of hand-classified documents suitably represented. The administrator or the expert does not have to write the filter, but let the algorithm learn the document properties that make them suitable for each class. This way, the traditional expert system ‘knowledge acquisition bottleneck’ is alleviated, as the expert can keep on doing what he or she does best (that is, in fact, classifying), and the system will be learning from his decisions. The Machine Learning approach has achieved considerable success in a number of tasks, and in particular, in spam filtering. In words by Sebastiani: (Automated Text Categorization) has reached effectiveness levels comparable to those of trained professionals. The effectiveness of manual Text Categorization is not 100% anyway (. . .) and, more importantly, it is unlikely to be improved substantially by the progress of research. The levels of effectiveness of automated TC are instead growing at a steady pace, and even if they will likely reach a plateau well below the 100% level, this plateau will probably be higher than the effectiveness levels of manual Text Categorization. [85] (p. 47)

Spam filtering can be considered an instance of (Automated) Text Categorization, in which the documents to classify are the user email messages, and the classes are spam and legitimate email. It may be considered easy, as it is a single-class problem, instead of the many classes that are usually considered in a thematic TC task.14 However, it shows special properties that makes it a very difficult task: 1. Both the spam and the complementary class (legitimate email) are not thematic, that is, they can contain messages dealing with several topics or themes. For instance, as of 1999, a 37% of the spam email was ‘get rich quick’ letters, a 25% was pornographic advertisements, and an 18% were software offers. The rest of the spam included Web site promos, investment offers, (fake) health products, contests, holidays, and others. Moreover, some of the spam types can overlap with legitimate messages, both commercial and coming from distribution lists. While the percentages have certainly changed (health and investment offers are now most popular), this demonstrates that current TC systems that relay on words and features for classification may have important problems because the spam class is very fragmented. 2. Spam has an always changing and often skewed distribution. For instance, according to the email security corporation MessageLabs [66], spam has gone from 76.1% of the email sent in the first quarter of 2005, to 56.9% in the first quarter of 2006, and back to 73.5% in the last quarter of 2006. On one side, 14 The Library of Congress Subject Headings 2007 edition has over 280000 total headings and references.

EMAIL SPAM FILTERING

65

Machine Learning classifiers expect the same class distribution they learnt from; any variation of the distribution may affect the classifier performance. On the other, skewed distributions like 90% spam (reached in 2004) may make a learning algorithm to produce a trivial acceptor, that is, a classifier that always classifies a message as spam. This is due to the fact that Machine Leaning algorithms try to minimize the error or maximize the accuracy, and the trivial acceptor is then 90% accurate. And even worse, the spam rate can vary from place to place, from company to company, and from person to person; in that situation, is very difficult to build a fit-them-all classifier. 3. Like many other classification tasks, spam filtering has imbalanced misclassification costs. In other words, the kinds of mistakes the filter makes are significant. No user will be happy with a filter that catches 99% of spam but that deletes a legitimate message once-a-day. This is because false positives (legitimate messages classified as spam) are far more costly than false negatives (spam messages classified as legitimate, and thus, reaching the users’ inbox). But again, it is not clear which proportion is right: a user may accept a filter that makes a false positive per 100 false negatives or per 1,000, etc. It depends on the user’s taste, the amount of spam he or she receives, the place where he or she lives, the kind of email account (work or personal), etc. 4. Perhaps the most difficult issue with spam classification is that it is an instance of adversarial classification [25]. An adversarial classification task is one in which there exists an adversary that modifies the data arriving at the classifier in order to make it fail. Spam filtering is perhaps the most representative instance of adversarial classification, among many others like computer intrusion detection, fraud detection, counter-terrorism, or a much related one: web spam detection. In this latter task, the system must detect which webmasters manipulate pages and links to inflate their rankings, after reverse engineering the ranking algorithm. The term spam, although coming from the email arena, is so spread that it is being used for many other fraud problems: mobile spam, blog spam, ‘spim’ (spam over Instant Messaging), ‘spit’ (spam over Internet Telephony), etc. Regarding spam email filtering, standard classifiers like Naı¨ve Bayes were initially successful [79], but spammers soon learned to fool them by inserting ‘nonspam’ words into emails, breaking up ‘spam’ ones like ‘Viagra’ with spurious punctuation, etc. Once spam filters were modified to detect these tricks, spammers started using new ones [34]. In our opinion, these issues make spam filtering a very unusual instance of Automated Text Categorization. Being said this, we must note that the standard structure of an Automated Text Categorization system is suited to the problem of spam filtering, and so we will discuss this structure in the next subsections.

66

E.P. SANZ ET AL.

3.2.2 Structure of Processing The structure of analysis and learning in modern content-based spam filters that make use of Machine Learning techniques, is presented in Fig. 4. In this figure,15 we represent processes or functions as rounded boxes, and information items as plain boxes. The analysis, learning, and retraining (with feedback information) of the classifier are time- and memory-consuming processes, intended to be performed offline and periodically. The analysis and filtering of new messages must be a fast process, to be performed online, as soon as the message arrives at the system. We describe these processes in detail below.

Legitimate email Spam

Incoming message

Analysis and learning

Analysis

Classifier

Message represented

OFFLINE

ONLINE

Filtering

Feedback learning Spam/ legitimate FIG. 4. Processing structure of a content-based spam filter that uses Machine Learning algorithms. 15

That figure is a remake of that by Belkin and Croft for text retrieval in [7].

EMAIL SPAM FILTERING

67

The first step in a learning-based filter is getting a collection of spam and legitimate messages and training a classifier on it. Of course, the collection of messages (the training collection) must represent the operating conditions of the filter as accurately as possible. Machine Learning classifiers often perform poorly when they have been trained on noisy, inaccurate, and insufficient data. There are some publicly available spam/legitimate email collections, discussed in Section 4 because they are most often used as test collections. The first process in the filter is learning a classifier on the training messages, named instances or examples. This process involves analyzing the messages in order to get a suitable representation for learning from them. In this process, the messages are often represented as attribute-value vectors, in which attributes are word tokens in the messages, and values are, for example, binary (the word token occurs on the message or not). Next, a Machine Learning algorithm is fed with the represented examples, and it produces a classifier, that is a model of the messages or a function able to classify new messages if they follow the suitable representation. Message representation and classifier learning are the key processes in a learning-based filter, and so they are discussed in detail in the next subsections. Once the classifier has been trained, it is ready to filter new messages. As they arrive, they are processed in order to represent them according to the format used in the training messages. That typically involves, for instance, ignoring new words in the messages, as classification is made on the basis of known words in the training messages. The classifier receives the represented message and classifies it as spam or legitimate (probably with a probability or a score), and tagged accordingly (or routed to the quarantine, the user mailbox, or whatever). The main strength of Machine Learning-based spam filters is their ability to learn from user relevance judgments, adapting the filter model to the actual email received by the user. When the filter commits a mistake (or depending on the learning mode, after every message), the correct output is submitted to the filter, which stores it for further re-training. This ability is a very noticeable strength, because if every user receives different email, every filter is different (in terms of stored data and model learned), and it is very complex to prepare attacks able to avoid the filters of all users simultaneously. As spammers benefit relies on the number of messages read, they are forced to prepare very sophisticated attacks able to break different vendor filters with different learned models. As we discuss in Section 6, they sometimes succeed using increasingly complex techniques.

3.2.3 Feature Engineering Feature engineering is the process of deciding, given a set of training instances, which properties will be considered for learning from them. The properties are the features or attributes, and they can take different values; this way, every training

68

E.P. SANZ ET AL.

instance (a legitimate or email message) is mapped into a vector in a multidimensional space, in which the dimensions are the features. As many Machine Learning are very slow or just unable to learn in highly dimensional spaces, it is often required to reduce the number of features used in the representation, performing attribute selection (determining a suitable subset of the original attributes) or attribute extraction (mapping the original set of features into a new, reduced set). These tasks are also a part of the feature engineering process.

3.2.3.1

Tokens and Weights. In Text Categorization and spam filtering, the most often used features are the sequences of characters or strings that minimally convey some kind of meaning in a text, that is, the words [39, 85]. More generally, we speak of breaking a text into tokens, a process named tokenization. In fact, this just follows the traditional Information Retrieval Vector Space Model by Salton [81]. This model specifies that, for the purpose of retrieval, texts can be represented as term-weight vectors, in which the terms are (processed) words (our attributes), and weights are numeric values representing the importance of every word in every document. First, learning-based filters have taken relatively simple decisions in this sense, following what was the state of the art on thematic text categorization. The simplest definition of features is words, being a word any sequence of alphabetic characters, and considering any other symbol as a separator or blank. This is the approach followed in the seminal work in this field, by Sahami et al. [79]. This work has been improved by Androutsopoulos et al. [4–6], in which they make use of a lemmatizer (or stemmer), to map words into their root, and a stoplist (a list of frequent words that should be ignored as they bring more noise than meaning to thematic retrieval: pronouns, prepositions, etc.). The lemmatizer used it Morph, included in the text analysis package GATE,16 and the stoplist includes the 100 most frequent words in the British National Corpus.17 In the previous work, the features are binary, that is, the value is one if the token occurs in the message, and zero otherwise. There are several more possible definitions of the weights or values, traditionally coming in the Information Retrieval field. For instance, using the same kind of tokens or features, the authors of [39] and [40] make use of TF.IDF weights. Its definition is the following one:   N wij ¼ tfij  log2 dfi 16 17

The GATE package is available at: http://gate.ac.uk/. The British National Corpus statistics are available at: http://www.natcorp.ox.ac.uk/.

EMAIL SPAM FILTERING

69

tfij being the number of times that the i-th token occurs in the j-th message, N the number of messages, and dfi the number of messages in which the i-th token occurs. The TF (Term Frequency) part of the weight represents the importance of the token or term in the current document or messages, while the second part IDF (Inverse Document Frequency) gives an ad hoc idea of the importance of the token in the entire document collection. TF weights are also possible. Even relatively straightforward decisions like lowercasing all words, can strongly affect the performance of a filter. The second generation of learning filters has been much influenced by Graham’s work [46, 47], who took advantage of the increasing power and speed of computers to ignore most preprocessing and simplifying decisions taken before. Graham makes use of a more complicated definition of a token: 1. Alphanumeric characters, dashes, apostrophes, exclamation points, and dollar signs are part of tokens, and everything else is a token separator. 2. Tokens that are all digits are ignored, along with HTML comments, not even considering them as token separators. 3. Case is preserved, and there is neither stemming nor stoplisting. 4. Periods and commas are constituents if they occur between two digits. This allows getting IP addresses and prices intact. 5. Price ranges like $20–$25 are mapped to two tokens, $20 and $25. 6. Tokens that occur within the To, From, Subject, and Return-Path lines, or within URLs, get marked accordingly. For example, ‘foo’ in the Subject line becomes ‘Subject*foo.’ (The asterisk could be any character you do not allow as a constituent.) Graham obtained very good results on his personal email by using this token definition and an ad hoc version of a Bayesian learner. This definition has inspired other more sophisticated works in the field, but has also led spammers to focus on tokenization as one of the main vulnerabilities of learning-based filters. The current trend is just the opposite: making nearly no analysis of the text, considering any white-space separated string as a token, and letting the system learn from a really big number of messages (tens of thousands instead of thousands). Even more, HTML is not decoded, and tokens may include HTML tags, attributes, and values.

3.2.3.2 Multi-word Features.

Some researchers have investigated features spanning over two or more tokens, seeking for ‘get rich,’ ‘free sex,’ or ‘OEM software’ patterns. Using statistical word phrases has not resulted into very good results in Information Retrieval [82], leading to even decreases in effectiveness. However, they have been quite successful in spam filtering. Two important works in this line are the ones by Zdziarski [102] and by Yerazunis [100, 101].

70

E.P. SANZ ET AL.

Zdziarski has first used case-sensitive words in his filter Dspam, and latter added what he has called ‘chained tokens.’ These tokens are sequences of two adjacent words, and follow the additional rules: l l l

There are no chains between the message header and the message body. In the message header, there are no chains between individual headers. Words can be combined with nonword tokens.

Chained tokens are not a replacement for individual tokens, but rather a complement to be used in conjunction with them for better analysis. For example, if we are analyzing an email with the phrase ‘CALL NOW, IT’s FREE!,’ there are four tokens created under standard analysis (‘CALL,’ ‘NOW,’ ‘IT’s,’ and ‘FREE!’) and three more chained tokens: ‘CALL NOW,’ ‘NOW IT’s,’ ‘IT’s FREE!.’ Chained tokens are traditionally named word bigrams in the fields of Language Modeling and Information Retrieval. In Table I, we can see how chained tokens may lead to better (more accurate) statistics. In this table, we show the probability of spam given the occurrence of a word, which is the conditional probability usually estimated with the following formula18: PðspamjwÞ 

N ðspam; wÞ N ðw Þ

where N(spam, w) is the number of times that w occurs in spam messages, and N(w) is the number of times the word w occurs. Counting can be done also per message: the number of spam messages in which w occurs, and the number of messages in which w occurs. In the table, we can see that the words ‘FONT’ and ‘face’ have probabilities next to 0.5, meaning that they neither support spam nor legitimate email. However, the probability of the bigram is around 0.2, what represents a strong support to the legitimate class. This is due to the fact that spammers and legitimate users use Table I SOME EXAMPLES OF CHAINED TOKENS ACCORDING TO [102] Token words w1=FONT, w2=face w1=color, w2=#000000 w1=that, w2=sent

18

P(spam|w1)

P(spam|w2)

P(spam|w1*w2)

0.457338 0.328253 0.423327

0.550659 0.579449 0.404286

0.208403 0.968415 0.010099

Please note that this is the Maximum Likelihood Estimator.

EMAIL SPAM FILTERING

71

different patterns of HTML code. While the firsts use the codes ad hoc, the seconds generate HTML messages with popular email clients like Microsoft Outlook or Mozilla Thunderbird, that always use the same (more legitimate) patterns, like putting the face attribute of the font next to the FONT HTML tag. We can also see how being ‘color’ and ‘#000000’ quite neutral, the pattern ‘color=#000000’ (the symbol ‘=’ is a separator) is extremely guilty. Zdziarski experiments demonstrate noticeable decreases in the error rates and especially in false positives, when using chained tokens plus usual tokens. Yerazunis follows a slightly more sophisticated approach in [100] and [101]. Given that spammers had already begun to fool learning-based filters by disguising spam-like expressions with intermediate symbols, he proposed to enrich the feature space with bigrams obtained by combining tokens in a sliding 5-words window over the training texts. He has called this Sparse Binary Polynomial Hash (SBPH), and it is implemented in his CRM114 Discriminator filter. In a window, all pairs of sorted words with the second word being the final one are built. For instance, given the phrase/window ‘You can get porn free,’ the following four bigrams are generated: ‘You free,’ ‘can free,’ ‘get free,’ ‘porn free.’ With this feature and what the author calls the Bayesian Chain Rule (a simple application of the Bayes Theorem), impressive results have been obtained on his personal email, claiming that the 99.9% (of accuracy) plateau has been achieved.

3.2.3.3 Feature Selection and Extraction.

Dimensionality reduction is a required step because it improves efficiency and reduces overfitting. Many algorithms perform very poorly when they work with a large amount of attributes (exceptions are k Nearest Neighbors or Support Vector Machines), so a process to reduce the number of elements used to represent documents is needed. There are mainly two ways to accomplish this task: feature selection and feature extraction. Feature selection tries to obtain a subset of terms with the same or even greater predictive power than the original set of terms. For selecting the best terms, we have to use a function that selects and ranks terms according to how good they are. This function measures the quality of the attributes. In the literature, terms are often selected with respect to their information gain (IG) scores [6, 79, 80], and sometimes according to ad hoc metrics [42, 71]. Information gain can be described as: X PðX ¼ x; C ¼ cÞ IGðX; CÞ ¼ PðX ¼ x; C ¼ cÞlog2 P ð X ¼ x Þ  Pð C ¼ c Þ x¼0;1;c¼s;l s being the spam class and l the legitimate email class in the above equation. Interestingly, IG is one of the best selection metrics [78]. Other quality metrics are

72

E.P. SANZ ET AL.

Mutual Information [35, 59], w2 [13, 99], Document Frequency [85, 99], or Relevancy Score [95]. Feature extraction is a technique that aims to generate an artificial set of terms different and smaller than the original one. Techniques used for feature extraction in Automated Text Categorization are Term Clustering and Latent Semantic Indexing. Term Clustering creates groups of terms that are semantically related. In particular, cluster group words then can be synonyms (like thesaurus classes) or just in the same semantic field (like ‘pitcher,’ ‘ball,’ ‘homerun,’ and ‘baseball’). Term Clustering, as far as we know, has not been used in the context of spam filtering. Latent Semantic Indexing [27] tries to alleviate the problem produced by polysemy and synonymy when indexing documents. It compresses index vectors creating a space with a lower dimensionality by combining original vectors using patterns of terms that appear together. This algebraic technique has been applied to spam filtering by Gee and Cook with moderate success [37].

3.2.4 Learning Algorithms One of the most important parts in a document classification system is the learning algorithm. Given an ideally perfect classification function F that assigns  each message a T/F value,19 learning algorithms have the goal to build a function F that approximates the function F. The approximation function is usually named a classifier, and it often takes the form of model of the data it has been trained in. The most accurate the approximation is, the better the filter will perform. A wide variety of learning algorithms families have been applied to spam classification, including the probabilistic Naı¨ve Bayes [4, 6, 42, 71, 75, 79, 80], rule learners like Ripper [34, 71, 75], Instance Based k-Nearest Neighbors (kNN) [6, 42], Decision Trees like C4.5 [14, 34], linear Support Vector Machines (SVM) [30], classifiers committees like stacking [80] and Boosting [14], and Cost-Sensitive learning [39]. By far, the most often applied learner is the probability-based classifier Naive Bayes. In the next sections, we will describe the most important learning families. To illustrate some of the algorithms that we are going to describe in following sections, we are going to use the public corpus SpamBase.20 The SpamBase collection contains 4,601 messages, being 1,813 (39.4%) spam. This collection has been preprocessed, and the messages are not available in raw form (to avoid privacy problems). Each message is described in terms of 57 attributes, being the first 48 19 Being T equivalent to spam; that is, spam is the ‘‘positive’’ class because it is the class to be detected. 20 This collection can be accessed at: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/spambase/.

EMAIL SPAM FILTERING

73

continuous real [0,100] attributes of type word_freq_WORD (percentage of words in the email that match WORD). A word is a sequence of alphanumeric characters, and the words have been selected as the most unbalanced ones in the collection. The last 9 attributes represent the frequency of special characters like ‘$’ and capital letters. In order to keep examples as simple as possible, we have omitted the nonword features and selected the five attributes with higher Information Gain scores. We have also binarized the attributes, ignoring the percentage values. The word attributes we use are ‘remove,’ ‘your,’ ‘free,’ ‘money,’ and ‘hp.’ The first four words correspond to spammy words, as they occur mainly in spam messages (within expressions like ‘click here to remove your email from this list,’ ‘get free porn,’ or ‘save money’), and the fifth word is a clue of legitimacy, as it is the acronym of the HP corporation in which the email message donors work.

3.2.4.1 Probabilistic Approaches. Probabilistic filters are historically the first filters and have been frequently used in recent years [46, 79]. This approach is mostly used in spam filters because of its simplicity and the very good results it can achieve. This kind of classifiers is based on the Bayes Theorem [61], computing the probability for a document d to belong to a category ck as: Pðck jd Þ ¼

Pðck ÞPðdjck Þ Pð d Þ

This probability can be used to make the decision about whether a document should belong to the category. In order to compute this probability, estimations about the documents in the training set are made. When computing probabilities for spam classification, we can obviate the denominator because we have only two classes (spam and legitimate), and one document cannot be classified in more than one of them, so denominator is the same for every k. Pðck Þ can be estimated as the number of documents in the training set belonging to the category, divided by the total number of documents. Estimating Pðdjck Þ is a bit more complicated because we need in the training set some documents identical to the one we want to classify. When using Bayesian learners, it is very frequent to find the assumption that terms in a document are independent and the order they appear in the document is irrelevant. When this happens, the learner is called ‘Naı¨ve Bayes’ learner [61, 65]. This way probability can be computed in the following way: Pðdjck Þ ¼

T Y i¼1

Pðti jck Þ

74

E.P. SANZ ET AL.

T being the number of terms considered in the documents representation and ti the i-th term (or feature) in the representation. It is obvious that this assumption about term independency is not found in a real domain, but it helps to compute probabilities and rarely affects accuracy [28]. That is the reason why this is the approach used in most works. The most popular version of a Naı¨ve Bayes classifier is that by Paul Graham [46, 47]. Apart from using an ad hoc formula for computing terms probabilities, it makes use of only the 15 most extreme tokens, defined as those occurring in the message that have a probability far from the half point (0.5). This approach leads to more extreme probabilities, and has been proven more effective than using the whole set of terms occurring or not in a message. So strong is the influence of this work in the literature that learning-based filters have been quite often named Bayesian Filters despite using radically different learning algorithms.

3.2.4.2 Decision Trees. One of the biggest problems of probabilistic approaches is that results are not easy to understand for human beings, speaking in terms of legibility. In the Machine Learning field, there are some families of learners that are symbolic algorithms, with results that are easier to understand for people. One of those is the family of Decision Tree learners. A Decision Tree is a finite tree with branches annotated with tests, and leaves being categories. Tests are usually Boolean expressions about term weights in the document. When classifying a document, we move from top to down in the tree, starting in the root and selecting conditions in branches that are evaluated as true. Evaluations are repeated until a leaf is reached, assigning the document to the category that has been used to annotate the leaf. There are many algorithms used for computing the learning tree. The most important ones are ID3 [35], C4.5 [18, 56, 61], and C5 [62]. One of the simplest ways to induce a Decision Tree from a training set of already classified documents is: 1. Verify if all documents belong to the same category. Otherwise, continue. 2. Select a term ti from the document representation and, for every feasible r weight values wir (i.e., 0 or 1), build a branch with a Boolean test ti = wir, and a node grouping all documents that satisfy the test. 3. For each document group, go to 1 and repeat the process in a recursive way. The process ends in each node when all grouped documents in it belong to the same category. When this happens, the node is annotated with the category name.

75

EMAIL SPAM FILTERING

A critical aspect in this approach is how terms are selected. We can usually find functions that measure the quality of the terms according to how good they are separating the set of documents, using Information Gain, or Entropy metrics. This is basically the algorithm used by ID3 system [76]. This algorithm has been greatly improved using better test selection techniques, and tree pruning algorithms. C4.5 can be considered a state of the art in Decision Tree induction. In Fig. 5, we show a portion of the tree learned by C4.5 on our variation of the SpamBase collection. The figure shows how a message that does not contain the words ‘remove,’ ‘money,’ and ‘free’ is classified as legitimate (often called ham), and if does not contain ‘remove,’ ‘money,’ and ‘hp’ but ‘free,’ it is classified as spam. The triangles represent other parts of the tree, omitted for the sake of readability.

3.2.4.3 Rule Learners. Rules of the type ‘if-then’ are the base of one of the concept description languages most popular in Machine Learning field. On one hand, they allow one to present the knowledge extracted using learning algorithms in an easy to understand way. On the other hand, they allow the experts to exanimate and validate that knowledge, and combine it with other known facts in the domain. Rule learners algorithms build this kind of conditional rules, with a logic condition on the left part of it, the premise, and the class name as the consequent, on the

Remove = 0

Money = 0

Free = 0

Remove = 1

Money = 1

Free = 1

HAM hp = 0

SPAM

hp = 1

HAM

FIG. 5. Partial Decision Tree generated by C4.5 using the SpamBase corpus.

76

E.P. SANZ ET AL.

right part. The premise is usually built as a Boolean expression using weights of terms that appear in the document representation. For binary weights, conditions can be simplified to rules that look for if certain combination of terms appears or not in the document. Actually there are several techniques used to induce rules. One of the most popular ones is the algorithm proposed in [68]. It consists of: 1. Iterate building one rule on each step, having the maximum classify accuracy over any subset of documents. 2. Delete those documents that are correctly classified by the rule generated in the previous step. 3. Repeat until the set of pending documents is empty. As in other learners, the criteria for selecting the best terms to build rules in step 1 can be quality metrics like Entropy or Information Gain. Probably the most popular and effective rule learner is Ripper, applied to spam filtering in [34, 71, 75]. If the algorithm is applied to our variation of the SpamBase collection, we get the next four rules: (remove = 1) => SPAM

((free = 1) and (hp = 0)) => SPAM ((hp = 0) and (money = 1)) => SPAM () => HAM The rules have been designed to be applied sequentially. For instance, the second rule is fired by a message that has not fired the first rule (and in consequence, does not contain the word ‘remove’), and that contains ‘free’ but not ‘hp.’ As it can be seen, the fourth rule is a default one that covers all the instances that are not covered by the previous rules.

3.2.4.4 Support Vector Machines. Support Vector Machines (SVM) have been recently introduced in Automatic Classification [55, 56], but they have become very popular rather quickly because of the very good results obtained with these algorithms especially in spam classification [22, 30, 39]. SVMs are an algebraic method, in which maximum margin hyperplanes are built in order to attempt to separate training instances, using, for example, Platt’s sequential minimal optimization algorithm (SMO) with polynomial kernels [72]. Training documents do not need to be linearly separable. Thus, the main method is based on calculation of an arbitrary hyperplane for separation. However, the simplest form of hyperplane is a plane, that is, a linear function of the attributes. Fast to learn, impressively effective in Text Categorization in general, and in spam

EMAIL SPAM FILTERING

77

classification in particular, SVMs represent one of the leading edges in learningbased spam filters. The linear function that can be obtained when using the SMO algorithm on our version of the SpamBase collection is: f ðmÞ ¼ ð 1:9999  w removeÞ þ ð0:0001  w yourÞ þ ð1:9992  w freeÞ þ ð1:9992  w moneyÞ þ ð2:0006  w hpÞ þ 0:9993 This function means that given a message m, in which the weights of the words are represented by ‘w_word,’ being 0 or 1, the message is classified as legitimate if replacing the weights in the functions leads to a positive number. So, negative factors like 1.9 (for ‘remove’) are spammy, and positive factors like 2.0 (for ‘hp’) are legitimate. Note that there is an independent term (0.9993) that makes a message without any of the considered words being classified as legitimate.

3.2.4.5 k-Nearest Neighbors.

Previous learning algorithms were based on building models about the categories used for classification. An alternative approach consists of storing training documents once they have been preprocessed and represented, and when a new instance has to be classified, it is compared to stored documents and assigned to the more appropriate category according to the similarity of the message to those in each category. This strategy does not build an explicit model of categories, but it generates a classifier know as ‘instance based,’ ‘memory based,’ or ‘lazy’ [68]. The most popular one is kNN [99]. The k parameter represents the number of neighbors used for classification. This algorithm does not have a training step and the way it works is very simple: 1. Get the k more similar documents in the training set. 2. Select the most often category in those k documents.

Obviously, a very important part of this algorithm is the function that computes similarity between documents. The most common formula to obtain the distance between two documents is the ‘cosine distance’ [82], which is the cosine of the angle between the vectors representing the messages that are being compared. This formula is very effective, as it normalizes the length of the documents or messages. In Fig. 6, we show a geometric representation of the operation of kNN. The document to be classified is represented by the letter D, and instances in the positive class (spam) are represented as X, while messages in the negative class are represented as O. In the left pane, we show the case of a linearly separable space. In that case, a linear classifier and a kNN classifier would give the same outcome (spam in this case). In the right pane, we can see a more mixed space, where kNN can show its

78

E.P. SANZ ET AL.

FIG. 6. Geometric representation of k-Nearest Neighbors (kNN) classifier. For making figure simpler, Euclidean distance has been used instead of ‘cosine distance.’

full power by selecting the locally most popular class (legitimate), instead of the one a linear classifier would learn (spam).

3.2.4.6 Classifier Committees. Another approach consists of applying different models to the same data, combining them to get better results. Bagging, boosting, and stacking are some of the techniques used to combine different learners. The concept of bagging (voting for classification, averaging for regression-type problems with continuous dependent variables of interest) combines the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (Machine Learning) algorithm for generating weights for weighted prediction or voting is the boosting procedure. The concept of boosting (applied to spam detection in [14]) is used to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification. A simple algorithm for boosting works like this: Start by applying some method to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult to classify (where the misclassification rate was high), and lower weights to those that were easy to classify (where the misclassification rate was low). Then apply the classifier again to the weighted data (or with different misclassification costs), and

EMAIL SPAM FILTERING

79

continue with the next iteration (application of the analysis method for classification to the re-weighted data). If we apply boosting to the C4.5 learner, with 10 iterations, we obtain 10 decision trees with weights, which are applied to an incoming message. The first tree is that of Fig. 5, with weight 1.88, and the last tree has got only a test on the word ‘hp’: if it does not occur in the message, it is classified as spam, and as legitimate otherwise. The weight of this last tree is only 0.05. The concept of stacking (short for Stacked Generalization) [80] is used to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method [97]. In stacking, the predictions from different classifiers are used as input into a meta-learner, which attempts to combine the predictions to create a final best predicted classification.

3.2.4.7 Cost-Sensitive Learning.

When talking about spam filtering, we have to take into account that costs of misclassifications are not balanced in real life, as the penalty of a false positive (a legitimate message classified as spam) is much higher than the one of a false negative (a spam message classified as legitimate). This is due to the risk of missing important valid messages (like those from the users’ boss!) because messages considered spam can be immediately purged or, in a more conservative scenario, conserved in a quarantine that the user rarely screens. The algorithms commented above assume balanced misclassification costs by default, and it is wise to use techniques to make those algorithms cost-sensitive, in order to build more realistic filters [39]. Thresholding is one of the methods used for making algorithms cost-sensitive. Once a numeric-prediction classifier has been produced using a set of pre-classified instances (the training set), one can compute a numeric threshold that optimizes cost on another set of pre-classified instances (the validation set). When new instances are to be classified, the numeric threshold for each of them determines if the instances are classified as positive (spam) or negative (legitimate). The cost is computed in terms of a cost matrix that typically assigns 0 cost to the hits, a positive cost to false negatives, and a much bigger cost to false positives. This way, instead of optimizing the error or the accuracy, the classifier optimizes the cost. The weighting method consists of re-weighting training instances according to the total cost assigned to each class. This method is equivalent to stratification by oversampling as described in [29]. The main idea is to replicate instances of the most costly class, to force the Machine Learning algorithm to correctly classify that class

80

E.P. SANZ ET AL.

instances. Another effective cost-sensitive meta-learner is the MetaCost method [29], based on building an ensemble of classifiers using the bagging method, relabeling training instances according to cost distributions and the ensemble outcomes, and finally training a classier on the modified training collection. In [39], the experiments with a number of Machine Learning algorithms and the three previous cost-sensitive schemas have shown that the combination of weighting and SVM is the most effective one.

3.3

Filtering by Compression

Compression has recently emerged as a new paradigm for Text Classification in general [90], and for spam filtering in particular [11]. Compression demonstrates high performance in Text Categorization problems in which classification depends on nonword features of a document, such as punctuation, word stems, and features spanning more than one word, like dialect identification and authorship attribution. In the case of spam filtering, they have emerged as the top performers in competitive evaluations like TREC [19]. An important problem of Machine Learning algorithms is the dependence of the results obtained with respect to their parameter settings. In simple words, a big number of parameters can make it hard to find the optimal combination of them, that is, the one that leads to the most general and effective patterns. Keogh et al. discuss the need of a parameter-free algorithm: Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm prevents us from imposing our prejudices and presumptions on the problem at hand, and let the data itself speak to us. [57] (p. 206)

Keogh presents data compression as a Data Mining paradigm that realizes this vision. Data compression can be used as an effective Machine Learning algorithm, especially on text classification tasks. The basic idea of using compression in a text classification task is to assign a text item to the class that best compresses it. This can be straightforwardly achieved by using any state-of-the-art compression algorithm and a command line process. As a result, the classification algorithm (the compression algorithm plus a decision rule) is easy to code, greatly efficient, and it does not need any preprocessing of the input texts. In other words, there is no need to represent it as a feature vector, avoiding one of the most difficult and challenging tasks, that is, text representation. As a side effect, this makes especially hard to reverse engineer the classification process, leading to more effective and stronger spam detection systems.

EMAIL SPAM FILTERING

81

The rule of classifying a message into the class that best compresses it is a straightforward application of the Minimum Description Length Principle (MDL) [11] that favors the most compact (short) explanation of the data. The recent works by Sculley and Brodley [84] and by Bratko et al. [11] formalize this intuition in a different way; we will follow partially the work by Sculley and Brodley. Let C(.) be a compression algorithm. A compression algorithm is a function that transforms strings into (shorter) strings.21 A compression algorithm usually generates a (possibly implicit) model. Let C(X|Y) also be the compression of the string Y using the model generated by compressing the string X. We denote by |S| the length of a string S, typically measured as a number of bits, and by XY the string Y appended to the string X. The MDL principle states that, given a class A of text instances, a new text X should be assigned to A if it compresses X better than AC. If we interpret the class A as a sequence of texts (and so it is AC), the decision rule may be: classðXÞ ¼ arg min fjCðcjXÞjg c¼A;AC

This formula is one possible decision rule for transforming a compression algorithm into a classifier. The decision rule is based on the ‘approximate’ distance22 |C(A|X)|. Sculley and Brodley [84] review a number of metrics and measures that are beyond the scope of this presentation. The length of the text X compressed with the model obtained from a class A can be approached by compressing AX: jCðAjXÞj  jCðAXÞj  jCðAÞj This way, any standard compressor can be used to predict the class of a text X, given the classes A and AC. Following [11], there are two basic kinds of compressors: two part and adaptive coders. The first class of compressors first trains a model over the data to encode, and then encode the data. These require two passes over the data. These kinds of encoders append the data to the model, and the decoder reads the model and then decodes the data. The most classical example of this kind of compressors is a double pass Huffman coder, which accumulates the statistics for the observed symbols, builds a statistically optimal tree using a greedy algorithm, and builds a file with the tree and the encoded data. 21 This presentation can be made in terms of sequences of bits instead of strings as sequences of characters. 22 This is not a distance in the proper sense, as it does not satisfy all the formal requirements of a distance.

82

E.P. SANZ ET AL.

Adaptive compressors instead start with an empty model (e.g., a uniform distribution over all the symbols), and update it as they are encoding the data. The decoder repeats the process, building its own version of the model as the decoding progresses. Adaptive coders require a single pass over the data, so they are more efficient in terms of time. They have also reached the quality of two-part compressors in terms of compressing ratio, and more interestingly for our purposes, they make the previous approximation an equality. Examples of adaptive compressors include all of those used in our work, which we describe below. For instance, the Dynamic Markov Compression (DMC), the Prediction by Partial Matching (PPM), and the family of Lempel-Ziv (LZ) algorithms are all adaptive methods (see [11] and [84]). Let us get back to the parameter issue. Most compression algorithms do not require any preprocessing of the input data. This clearly avoids the steps of feature representation and selection usually taken when building a learning-based classifier. Also, most often compressors have a relatively small number of parameters, approaching the vision of a parameter-free or parameter-light Data Mining. However, a detailed analysis performed by Sculley and Brodley in [84] brings light to this point, as they may depend on explicit parameters in compression algorithms, the notion of distance used, and the implicit feature space defined by each algorithm. On the other hand, it is extremely easy to build a compression-based classifier by using the rules above and a standard out-of-the-shelf compressor, and they have proven to be effective on a number of tasks, and top-performer in spam filtering. In words by Cormack (one of the organizers of the reputed TREC spam Track competition) and Bratko: ‘‘At TREC 2005, arguably the best-performing system was based on adaptive data compression methods’’, and ‘‘one should not conclude, for example, that SVM and LR (Logistic Regression) are inferior for on-line filtering. One may conclude, on the other hand, that DMC and PPM set a new standard to beat on the most realistic corpus and test available at this time.’’ [21]

While other algorithms have been used in text classification, it appears that only DMC and PPM have been used in spam filtering. The Dynamic Markov Compression [11] algorithm models an information source with a finite state machine (FSM). It constructs two variable-order Markov models (one for spam and one for legitimate email), and classifies a message according to which of the models predicts it best. The Partial Prediction Matching algorithm is a back-off smoothing technique for finite-order Markov models, similar to back-off models used in natural language processing, and has set the standard for lossless text compression since its introduction over two decades ago according to [17]. The best of both is DMC, but PPM is better known and very competitive with it (and both superior to Machine Learning approaches at spam filtering). There are some efforts that also work at the character level (in combination with some ad hoc scoring function), but they are different

EMAIL SPAM FILTERING

83

from compression as this has been principle designed to model sequential data. In particular, the IBM’s Chung-Kwei system [77] uses pattern matching techniques originally developed for DNA sequences, and the Pampapathi et al. [70] have proposed a filtering technique based on the suffix tree data structure. This reflects the trend of minimizing tokenization complexity in order to avoid one of the most relevant vulnerabilities of learning-based filters.

3.4

Comparison and Summary

Heuristic filtering relying on the manual effort of communities of experts has been quickly beaten by spammers, as they have stronger (economic) motivation and time. But automated with content-based methods, a new episode in the war against spam is being written. The impact of and success of content-based spam filtering using Machine Learning and compression is significant, as the spammers have invested much effort to avoid this kind of filtering, as we review below. The main strength of Machine Learning is that it makes the filter adapted to the actual users’ email. As every user is unique, every filter is different, and it is quite hard to avoid all vendors’ and users’ filters simultaneously. The processing structure of learning-based filters includes a tokenization step in which the system identifies individual features (typically strings of character, often meaningful words) on which the system relies to learn and classify. This step has been recognized as the major vulnerability of learning-based filtering, and the subject of most spammers’ attacks, in the form of tokenization attacks and image spam. The answer from the research community has been fast and effective, focusing on character-level modeling techniques based on compression, that detect implicit patterns that not even spammers know they are using! Compression-based methods have proven top effective in competitive evaluations, and in consequence, that can be presented as the current (successful) trend in content-based spam filtering.

4.

Spam Filters Evaluation

Classifier system evaluation is a critical point: no progress may be expected if there is no way to assess it. Standard metrics, collections, and procedures are required, which allow cross-comparison of research works. The Machine Learning community has well-established evaluation methods, but these have been adapted and improved when facing what is probably the main difficulty in spam filtering: the problem of asymmetric misclassification costs. The fact that a false positive

84

E.P. SANZ ET AL.

(a legitimate message classified as spam) is much more damaging than a false negative (a spam classified as legitimate) implies that evaluation metrics must attribute bigger weights to worse errors, but . . . which weights? We analyze this point in Section 4.3. Scientific evaluations have well-defined procedures, consolidated metrics, and public test collections that make cross-comparison of results relatively easy. The basis of scientific experiments is that they have to be reproducible. There are a number of industrial evaluations, performed by the filter vendors themselves and typically presented in their white papers, and more interestingly, those performed by specialized computer magazines. For instance, in [53], the antispam appliances BorderWare MXtreme MX-200 and Proofpoint P800 Message Protection Appliance are compared using an ad hoc list of criteria, including Manageability, Performance, Ease of Use, Setup, and Value. Or in [3], 10 systems are again tested according to a different set of criteria, allowing no possible comparison. The limitations of industrial evaluations include self-defined criteria, private test collections, and selfdefined performance metrics. In consequence, we focus on scientific evaluations in this chapter. The main issues in the evaluation of spam filters are the test collections, the running procedure, and the evaluation metrics. We discuss these issues in the next subsections, with special attention to what we consider a real and accurate standard in current spam filter evaluation, the TREC spam Track competition.

4.1

Test Collections

A test collection is a set of manually classified messages that are sent to a classifier in order to measure its effectiveness, in terms of hits and mistakes. It is important that test collections are publicly available, because it allows the comparison of approaches and the improvement of the technology. On the other hand, message collections may include private email, so privacy protection is an issue. There are a number of works in which the test collections employed are kept private, as they are personal emails from the author or are donated to them with the condition of being kept private, like in early works [30], [75], and [79], or even in more recent works like [36], [47], and [69]. The privacy problem can be solved in different ways: l

l

Serving a processed version of the messages that does not allow rebuilding them. This approach has been followed in the SpamBase, PU1, and the 2006 ECML-PKDD Discovery Challenge public collections. Building the collection using only messages from public sources. This is the approach in the Lingspam and SpamAssassin Public Corpus.

EMAIL SPAM FILTERING l

85

Keeping the collection private in the hands of a reputable institution that performs the testing on behalf of the researchers. The TREC spam Track competition is such an institution, and makes some test collections public, and keeps some others private.

In the next paragraphs, we present the test collections that have been publicly available, solving the privacy problem: l

l

l

SpamBase23 is an email message collection containing 4,601 messages, being 1,813 (39%) marked as spam. The collection comes in preprocessed (not raw) form, and its instances have been represented as 58-dimensional vectors. The first 48 features are words extracted from the original messages, without stop list or stemming, and selected as the most unbalanced words for the UCE class. The next 6 features are the percentage of occurrences of the special characters ‘;,’ ‘(,’ ‘[,’ ‘!,’ ‘$,’ and ‘#.’ The following 3 features represent different measures of occurrences of capital letters in the text of the messages. Finally, the last feature is the class label. This collection has been used in, for example, [34], and its main problem is that it is preprocessed, and in consequence, it is not possible to define and test other features apart from those already included. The PU1 corpus,24 presented in [6], consists of 1,099 messages, being 481 (43%) spam and 618 legitimate. Been received by Ion Androutsopoulos, it has been processed by removing attachments and HTML tags. To respect privacy issues, in the publicly available version of PU1, fields other than ‘Subject:’ have been removed, and each token (word, number, punctuation symbol, etc.) in the bodies or subjects of the messages was replaced by a unique number, the same number throughout all the messages. This hashing ‘encryption’ mechanism makes impossible to perform experiments with other tokenization techniques and features apart from those included by the authors. It has been used in, for example, [14] and [52]. The ECML-PKDD 2006 Discovery Challenge collection25 has been collected by the challenge organizers in order to test how to improve spam classification using untagged data [9]. It is available in a processed form: strings that occur fewer than four times in the corpus are eliminated, and each message is represented by a vector indicating the number of occurrences of each feature in the message. The same comments to PU1 are applicable to this collection.

23 This collection has been described above, but some information is repeated for the sake of comparison. 24 Available at: http://www.aueb.gr/users/ion/data/PU123ACorpora.tar.gz. 25 Available at: http://www.ecmlpkdd2006.org/challenge.html.

86 l

l

E.P. SANZ ET AL.

The Lingspam test collection,26 presented in [4] and used in many studies (including [14], [39], and very recent ones like [21]), has been built by mixing spam messages with messages extracted from spam-free public archives of mailing lists. In particular, the legitimate messages have been extracted from the Linguist list, a moderated (hence, spam-free) list about the profession and science of linguistics. The number of legitimate messages is 2,412, and the number of spam messages is 481 (16%). The Linguist messages are, of course, more topic-specific than most users’ incoming email. They are less standardized, and for instance, they contain job postings, software availability announcements, and even flame-like responses. In consequence, the conclusions obtained from experiments performed on it are limited. The SpamAssassin Public Corpus27 has been collected by Justin Mason (a SpamAssassin developer) with the public contributions of many others, and consists of 6,047 messages, being 1,897 (31%) spam. The legitimate messages (named ‘ham’ in this collection) have been further divided into easy and hard (the ones that make use of rich HTML, colored text, spam-like words, etc.). As it is relatively big, realistic, and public, it has become the standard in spam filter evaluation. Several works make use of it (including [11], [21], [22], and [67]), and it has been routinely used as a benchmark in the TREC spam Track [19].

Apart from these collections, the TREC spam Track features several public and private test collections, like the TREC Public Corpus – trec05p-1, and the Mr. X, S.B., and T.M. Private Corpora. For instance, the S.B. corpus consists of 7,006 messages (89% ham, 11% spam) received by an individual in 2005. The majority of all ham messages stems from four mailing lists (23%, 10%, 9%, and 6% of all ham messages) and private messages received from three frequent correspondents (7%, 3%, and 2%, respectively), while the vast majority of the spam messages (80%) are traditional spam: viruses, phishing, pornography, and Viagra ads. Most TREC collections have two very singular and interesting properties: 1. Messages are chronologically sorted, allowing testing the effect of incremental learning, what we mention below as online testing. 2. They have build by using an incremental procedure [20], in which messages are tagged using several antispam filters, and the classification is reviewed by hand when a filter disagrees. TREC collections are also very big in comparison with the previously described ones, letting the researchers arriving at more trustable conclusions. 26 27

Available at: http://www.aueb.gr/users/ion/data/lingspam_public.tar.gz. Available at: http://spamassassin.apache.org/publiccorpus/.

EMAIL SPAM FILTERING

87

In other words, TREC spam Track has set the evaluation standard for antispam learning-based filters.

4.2

Running Test Procedure

The most frequent evaluation procedure for a spam filter is batch evaluation. The spam filter is trained on a set of messages, and applied to a different set of test messages. The test messages are labeled, and it is possible to compare the judgment of the filter and of the expert, computing hits and mistakes. It is essential that the test collection is similar to, but disjoint of, the training one, and that it reflects operational settings as close as possible. This is the test procedure employed in most evaluations using the SpamBase, PU1, and Linspam test collections. A refinement of this evaluation is to perform N-fold cross-validation. Instead of using a separate test collection, portions of the labeled collection are sometimes used as training and as test sets. In short, the labeled collection is randomly divided into N sets or folds (preserving the class distribution), and N tests are run, using N1 folds as training set, and the remaining one as test set. The results are averaged over the N runs, leading to more statistically valid figures, as the experiment does not depend on unpredictable features of the data (all of them are used for training and testing). This procedure has been sometimes followed in spam filtering evaluation, like in [39]. A major criticism to this approach is that the usual operation of spam filters allows them to learn from the mistakes they made (if the user reports them, of course). The batch evaluation does not allow the filters to learn as it classifies, and ignores chronological ordering if available. The TREC organizers have instead approached filter testing as an on-line learning task in which messages are presented to the filter, one at a time, in chronological order. For each message, the filter predicts its class (spam or legitimate) by computing a score S which is compared to a fixed but arbitrary threshold T. Immediately after the prediction, the true class is presented to the filter so that it might use this information in future predictions. This evaluation procedure is supported with a specific set of scripts, requiring a filter to implement the next command-line functions: l l

l

l

l

Initialize – creates any files or servers necessary for the operation of the filter. Classify message – returns ham/spam classification and spamminess score for message. Train ham message – informs filter of correct (ham or legitimate) classification for previously classified message. Train spam message – informs filter of correct (spam) classification for previously classified message. Finalize – removes any files or servers created by the filter.

88

E.P. SANZ ET AL.

This is the standard procedure used in TREC. An open question is if both methods are equivalent, in term of the results obtained in public works. Cormack and Bratko have made in [21] a systematic comparison of a number of top-performing algorithms following both procedures, and arriving at the conclusion that the current leaders are compression methods, and that the online procedure is more suitable because it is closer to operational settings.

4.3

Evaluation Metrics

We have divided the metrics used in spam filtering test studies into three groups: the basic metrics employed in the initial works, the ROCCH method as a quite advanced one that addresses prior errors, and the TREC metrics as the current standard.

4.3.1 Basic Metrics The effectiveness of spam filtering systems is measured in terms of the number of correct and incorrect decisions. Let us suppose that the filter classifies a given number of messages. We can summarize the relationship between the system classifications and the correct judgments in a confusion matrix, like that shown in Table II. Each entry in the table specifies the number of documents with the specified outcome. For the problem of filtering spam, we take spam as the positive class (+) and legitimate as the negative class (–). In this table, the key ‘tp’ means ‘number of true-positive decisions’ and ‘tn,’ ‘fp,’ and ‘fn’ refer to the number of ‘true-negative,’ ‘false-positive,’ and ‘false-negative’ decisions, respectively. Most traditional TC evaluation metrics can be defined in terms of the entries of the confusion matrix. F1 [85] is a measure that gives equal importance to recall and precision. Recall is defined as the proportion of class members assigned to a category by a classifier. Precision is defined as the proportion of correctly assigned

Table II A SET OF N CLASSIFICATION DECISIONS REPRESENTED AS A CONFUSION MATRIX

+ 

+ tp fn

 fp tn

EMAIL SPAM FILTERING

89

documents to a category. Given a confusion matrix like the one shown in the table, recall (R), precision (P), and F1 are computed using the following formulas: R¼

tp tp þ fn



tp tp þ fp

F1 ¼

2RP RþP

Recall and precision metrics have been used in some of the works in spam filtering (e.g., [4–6, 42, 80]). Other works make use of standard ML metrics, like accuracy and error [71, 75]. Recall that not all kinds of classification mistakes have the same importance for a final user. Intuitively, the error of classifying a legitimate message as spam (a false positive) is far more dangerous than classifying a spam message as legitimate (a false negative). This observation can be re-expressed as the cost of a false positive is greater than the cost of a false negative in the context of spam classification. Misclassification costs are usually represented as a cost matrix in which the entry C(A,B) means the cost of taking a A decision when the correct decision is B, that is the cost of A given B (cost(A|B)). For instance, C(+,–) is the cost of a false-positive decision (classifying legitimate email as spam) and C(–,+) is the cost of a false-negative decision. The situation of unequal misclassification costs has been observed in many other ML domains, like fraud and oil spills detection [74]. The metric used for evaluating classification systems must reflect the asymmetry of misclassification costs. In the area of spam filtering, several cost-sensitive metrics have been defined, including weighted accuracy (WA), weighted error (WE), and total cost ratio (TCR) (see e.g., [5]). Given a cost matrix, the cost ratio (CR) is defined as the cost of a false positive over the cost of a false negative. Given the confusion matrix for a classifier, the WA, WE, and TCR for the classifier are defined as: WA ¼

CR  tn þ tp CRðtn þ fpÞ þ ðtp þ fnÞ

WE ¼

CR  fp þ fn CRðtn þ fpÞ þ ðtp þ fnÞ

TCR ¼

tn þ fp CR  fp þ fn

90

E.P. SANZ ET AL.

The WA and WE metrics are versions of the standard accuracy and error measures that penalize those mistakes that are not preferred. Taking the trivial rejecter that classifies every message as legitimate (equivalent to not using a filter) as a baseline, the TCR of a classifier represents to what extent is a classifier better than it. These metrics are less standard than others used in cost-sensitive classification, as Expected Cost, but to some extent they are equivalent. These metrics have been calculated for a variety of classifiers, in three scenarios corresponding to three CR values (1, 9, and 999) [4–6, 42, 80]. The main problem presented in the literature on spam cost-sensitive categorization is that the CR used does not correspond to real world conditions, which are unknown and may be highly variable. There is no evidence that a false positive is neither 9 nor 999 times worse than the opposite mistake. As class distributions, CR values may vary from user to user, from corporation to corporation, and from ISP to ISP. The evaluation methodology must take this fact into account. Fortunately, there are methods that allow evaluating classifiers effectiveness when target (class distribution and CR) conditions are not known, as in spam filtering. In the next subsection, we introduce the ROCCH method for spam filtering.

4.3.2 The ROCCH Method The receiver operating characteristics (ROC) analysis is a method for evaluating and comparing a classifiers performance. It has been extensively used in signal detection, and introduced and extended by Provost and Fawcett in the Machine Learning community (see e.g., [74]). In ROC analysis, instead of a single value of accuracy, a pair of values is recorded for different class and cost conditions a classifier is learned. The values recorded are the false-positive (FP) rate and the true-positive (TP) rate, defined in terms of the confusion matrix as: FP ¼

fp fp þ tn

TP ¼

tp tp þ fn

The TP rate is equivalent to the recall of the positive class, while the FP rate is equivalent to 1 less than the recall of the negative class. Each (FP,TP) pair is plotted as a point in the ROC space. Most ML algorithms produce different classifiers in different class and cost conditions. For these algorithms, the conditions are varied to obtain a ROC curve. We will discuss how to get ROC curves by using methods for making ML algorithms cost-sensitive.

91

EMAIL SPAM FILTERING

One point on a ROC diagram dominates another if it is above and to the left, that is, has a higher TP and a lower FP. Dominance implies superior performance for a variety of common performance measures, including expected cost (and then WA and WE), recall, and others. Given a set of ROC curves for several ML algorithms, the one which is closer to the left upper corner of the ROC space represents the best algorithm. Dominance is rarely got when comparing ROC curves. Instead, it is possible to compute a range of conditions in which one ML algorithm will produce at least better results than the other algorithms. This is done through the ROC convex hull (ROCCH) method, first presented in [74]. Concisely, given a set of (FP,TP) points, that do not lie on the upper convex hull, corresponds to suboptimal classifiers for any class and cost conditions. In consequence, given a ROC curve, only its upper convex hull can be optimal, and the rest of its points can be discarded. Also, for a set of ROC curves, only the fraction of each one that lies on the upper convex hull of them is retained, leading to a slope range in which the ML algorithm corresponding to the curve produces best performance classifiers. An example of ROC curves taken from [39] is presented in Fig. 7. As it can be seen, there is no single dominator. The ROC analysis allows a visual comparison of the performance of a set of ML algorithms, regardless of the class and cost conditions. This way, the decision of which is the best classifier or ML algorithm can be delayed until target (real world) conditions are known, and valuable information can be obtained at the same time. In the most advantageous case, one algorithm is dominant over the entire slope range.

1.0

0.9 0.0

0.1 C4.5 SVM

0.1

0.2 NB Roc

FIG. 7. A ROC curve example.

0.2 PART CH

92

E.P. SANZ ET AL.

Usually, several ML algorithms will lead to classifiers that are optimal (among those tested) for different slope ranges, corresponding to different class and cost conditions. Operatively, the ROCCH method consists of the following steps: 1. For each ML algorithm, obtain a ROC curve and plot it (or only its convex hull) on the ROC space. 2. Find the convex hull of the set of ROC curves previously plotted. 3. Find the range of slopes for which each ROC curve lies on the convex hull. 4. In case the target conditions are known, compute the corresponding slope value and output the best algorithm. In other case, output all ranges and best local algorithms or classifiers. We have made use of the ROCCH method for evaluating a variety of ML algorithms for the problem of spam filtering in [39]. This is the very first time that ROC curves have been used in spam filtering testing, but they have become a standard in TREC evaluations.

4.3.3 TREC Metrics TREC Spam Track metrics [19] are considered also a standard in terms of spam filtering evaluation. The main improvement over the ROC method discussed above is their adaptation to the online procedure evaluation. As online evaluation allows a filter to learn from immediately classified errors, its (TP,FP) rate is always changing and the values improving with time. The ROC graph is transformed into a single numeric figure, by computing the Area Under the ROC curve (AUC). As the evaluated filters are extremely effective, it is better to report the inverse 1AUC value, which is computed over time as the learningtesting online process is active. This way, it is possible to get an idea of how quickly the systems learn, and at which levels of error the performance arrives a plateau. In Fig. 8, a ROC Learning Curve graph is also shown, taken from [19] for Mr. X text collection. It is easy to see how filters start committing a relatively high number of mistakes at the beginning of the ROC Learning Curve, and they improve their result finally achieving an average performance level.

5.

Spam Filters in Practice

Implementing effective and efficient spam filters is a nontrivial work. Depending on concrete needs, it should be implemented in a certain way and the management of spam messages can vary depending on daily amount of messages, the estimated impact of possible false positives, and other peculiarities of users and companies.

93

EMAIL SPAM FILTERING ROC learning curve 621SPAM1mrx kidSPAM1mrx tamSPAM1mrx lbSPAM2mrx ijSPAM2mrx yorSPAM2mrx crmSPAM2mrx

(1-ROCA)% (logit scale)

50.00

10.00

1.00

0.10

0.01

0

5000

10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000

Messages FIG. 8. A ROC learning curve from TREC.

Figure 9 shows a diagram of the mail transportation process, where spam filters can be implemented at any location in this process, although the most common and logical ones are in the receiver’s SMTP Server/Mail Delivery Agent (server side) or in the receiver’s email client (client side). There are also some enterprise solutions that perform the spam filtering in external servers, which is not reflected in this diagram as not being usual at the current moment. There are important questions that we cannot obviate and we need to test carefully for their response before implementing or implanting a spam filter solution. Should we implement our spam filter in the client side or in the server side? Should we use any kind of collaborative filtering? Should the server delete the spam messages or the user should decide what to do with those undesirable messages?

5.1

Server Side Versus Client Side Filtering

The first important choice when implementing or integrating a spam filter in our IT infrastructure is the location of the filter: in the client side or in the server side. Each location has its own benefits and disadvantages and those must be measured to choose the right location depending on the user/corporation concrete needs.

94

E.P. SANZ ET AL.

Sender’s email client

Sender’s SMTP server

Receiver’s SMTP client Server side filtering Mail delivery agent

Receiver’s POP server

Receivers’s email client

Client side filtering

FIG. 9. Email transport process and typical location of server and client side filters. Filters can, virtually, be implemented in any step of the process.

The main features of server side spam filters are: l

l

l

They reduce the network load, as the server does not send the mails categorized as spam, which can be the biggest part of the received messages. They also reduce the computational load on the client side, as the mail client does not need to check each received message. This can be very helpful when most clients have a low load capacity (handhelds, mobile phones, etc.). They allow certain kinds of collaborative filtering or a better integration of different antispam techniques. For example, when detecting spam messages received in different users’ account from the same IP, this can be added to a black list preventing other users to receive messages from the same spammer.

On the other side, client side spam filters: l l

l

Allow a more personalized detection of management of the messages. Integrate the knowledge from several email accounts belonging to the same user, preventing the user to receive the same spam in different email accounts. Reduce the need of dedicated mail servers as the computation is distributed among all the users’ computers.

EMAIL SPAM FILTERING

95

As corporations and other organizations grow, their communications also need to grow and they increasingly rely on dedicated servers and appliances that are responsible for email services, including the email server, and the full email security suite of services, including the antivirus and antispam solutions. Appliances and dedicated servers are the choice for medium to big corporations, while small organizations and individual users can group the whole Internet services in one machine. In big corporations, each Internet-related service has its own machine, as the computational requirements deserve special equipment, and it is a nice idea to distribute the services in order to minimize risks.

5.2

Quarantines

Using statistical techniques is really simple to detect almost all the received spam but a really difficult, almost impossible, point is to have a 0 false-positive ratio. As common bureaucratic processes and communication are delegated to be conducted via email, false positives are being converted into a really annoying question; losing only one important mail can have important economic consequences or, even, produce a delay in some important process. On the other hand as each user is different, what should be done with a spam message should be left to the user’s choice. A clocks lover user would conserve Rolex-related spam. A key point here is that spam is sent because the messages sent are of interest to a certain set of users. These points are difficult to solve with server spam filter solution as the spam filter is not as personalized as a spam filter implemented on the client side. Quarantine is a strategy developed to face these points. The messages detected as spam by the spam filter are stored in the server for a short period of time and the server mails a Quarantine Digest to the user reporting all the messages under quarantine. The user is given the choice of preserving those messages or deleting them. Quarantine is a helpful technique that allows to: l l l l

Reduce the disk space and resources the spam is using on the mail servers. Reduce the user’s level of frustration when they receive spam. Keeps spam from getting into mailing lists. Prevent auto replies (vacation, out of office, etc.) from going back to the spammers.

In Fig. 10, we show an example of quarantine, in the Trend Micro InterScan Messaging Security Suite, a solution that includes antivirus and antispam features, and that is designed for serve side filtering. The quarantine is accessed by actual

96

E.P. SANZ ET AL.

FIG. 10. An example of list of messages in the Trend Micro InterScan Messaging Security Suite.

users through a web application, where they log and screen the messages that the filter has considered spam. This quarantine in particular features the access to a white list of approved senders, a white list where the user can type in patterns that make the filter ignore messages, like those coming from the users’ organization.

5.3

Proxying and Tagging

There are a number of email clients that currently implement their own contentbased filters, most often based on the Graham’s Bayesian algorithm. For instance, the popular open-source Mozilla Thunderbird client includes a filter that is able to learn from the users’ actual email, leading to better personalization and increased effectiveness. However, there are a number of email clients that do not feature an antispam filter at all. Although there are a number of extensions or plugins for popular email clients (like the ThunderBayes extension for Thunderbird, or the SpamBayes Microsoft

EMAIL SPAM FILTERING

97

Outlook plugin – both including the SpamBayes filter controls into the email clients), the user may wish to keep the antispam software out of the email client, in order to change any of them if a better product is found. There are a number of spam filters that run as POP/SMTP/Imap proxy servers. These products download the email on behalf of the user, analyze it deciding if it is spam or legitimate, and tag it accordingly. Example of these products are POPFile (that features general email classification, apart from spam filtering), SpamBayes (proxy version), K9, or SAwin32 (Fig. 11). As an example, the program SAwin32 is a POP proxy that includes a fully functional port of SpamAssassin for Microsoft Windows PCs. The proxy is configured to access the POP email server from where the user downloads his email and to check it using the SpamAssassin list, rule, and Bayesian filters. If found to be spam, a message is tagged with a configurable string in the subject (e.g., the default is ‘*** SPAM ***’) and the message is forwarded to the user with an explanation in the body and the original message as an attachment. The explanation presents a digest of the message, and describes the tests that have been fired by the message, the spam

FIG. 11. An example of message scanned by the SAwin32 proxy, implementing the full set of tests and techniques of SpamAssassin.

98

E.P. SANZ ET AL.

score of the message, and other administrative data. The proxy also adds some nonstandard headers, beginning with ‘X-,’28 like ‘X-Spam-Status,’ that includes the spam value of the message (‘Yes’ in this case), and the tests and scores obtained. The user can then configure a filter in his email client, sending to a spam box (a local quarantine) all the messages that arrive tagged as spam. The user can also suppress the additional tag and rely on the spam filter headers. Also, most proxy-type filters also include the possibility of not downloading the messages tagged as spam to the email client, and keeping a separate quarantine in the proxy program, many often accessed as a local web application. The tagging approach can also work at the organization level. The general email server tags the messages for all users, but each one prepares a local filter based on the server-side tags if that is needed.

5.4

Best and Future Practical Spam Filtering

As stated in [27], ‘the best results in spam filtering could be achieved by a combination of methods: Black lists stop known spammers, graylists eliminate spam sent by spam bots, decoy email boxes alert to new spam attacks, and Bayesian filters detect spam and virus attack emails right from the start, even before the black list is updated.’ False positives are still a problem due to the important effects that loosing an important mail can produce, along with the bouncing emails created by viruses [8], an important new kind of spam. Probably, in the future, all these techniques will be mixed with economic and legal antispam measures like computing time-based systems,29 money-based systems [53], strong antispam laws [15], and other high-impact social measures [80]. Each day, the filters apply more and more measures to detect spam due to its high economic impact.

6.

Attacking Spam Filters 6.1

Introduction

As the volume of bulk spam email increases, it becomes more and more important to apply techniques that alleviate the cost that spam implies. Spam filters evolve to better recognize spam and that has forced the spammers to find 28 Non-standard headers in emails begin with ‘X-.’ In particular, our example includes some Mozilla Thunderbird headers, like ‘X-Mozilla-Status.’ 29 See, for instance, the HashCash service: http://www.hashcash.org.

EMAIL SPAM FILTERING

99

new ways to avoid the detection of their spam, ensuring the delivery of their messages. For example, as statistical spam filters began to learn that words like ‘Viagra’ mostly occur in spam, spammers began to obfuscate them with spaces and other symbols in order to transform spam-related words in others like ‘V-i-ag-r-a’ that, while conserving all the meaning for a human being, are hardly detected by software programs. We refer to all the techniques that try to mislead the spam filters as attacks, and the spammers employing all these methods as attackers, since their goal is to mislead the normal behavior of the spam filter, allowing a spam message to be delivered as a normal message. A good spam filter would be most robust to past, present, and future attacks, but most empirical evaluations of spam filters ignore this because the spammers’ behavior is unpredictable and then, the real effectiveness of a filter cannot be known until its final release. The attacker’s point of view is of vital importance when dealing with spam filters because it gives full knowledge about the possible attacks to spam filters. This perspective also gives a schema of the way of thinking of spammers, which allows predicting possible attacks and to detect tendencies that can help to construct a robust and safe filter. Following this trend, there exist some attempts to compile spammers’ attacks as the Graham Cumming’s Spammer’s Compendium [42]. Usually, a spam filter is part of a corporate complex IT architecture, and attackers are capable to deal with all the parts of that architecture, exploiting all the possible weak points. Attending to this, there exist direct attacks that try to exploit the spam filter vulnerabilities and indirect attacks that try to exploit other weak points of the infrastructure. Indirect attacks’ relevance has been growing since 2004 as spammers shifted their attacks away from content and focuses more on the SMTP connection point [75]. Spammers usually make use of other security-related problems like virus and trojans in order to increase their infrastructure. Nowadays, a spammer uses trojan programs to control a lot of zombie-machines from which he is able to attack many sites while not being easily located as he is far from the origin of the attacks. That makes it more and more difficult to counteract the spammers’ attacks and increases the side effects of the attacks.

6.2

Indirect Attacks

Mail servers automatically send mails when certain delivery problems, such as a mail over-quota problem, occur. These automatically sent emails are called bounce messages. These bounce messages can be seen, in a way, as auto replies (like the out of office auto replies) but are not sent by human decisions, and in fact are sent

100

E.P. SANZ ET AL.

automatically by the mail server. All these auto replies are discussed in the RFC383430 that points out that it must be sent to the Return-Path established in the received email that has caused this auto reply. The return message must be sent without Return-Path in order to avoid an infinite loop of auto replies. From these bounce messages, there exist two important ones: the NDRs (NonDelivery Reports), which are a basic function of SMTP and inform that a certain message could not be delivered and the DSNs (Delivery Status Notifications) that can be explicitly required by means of the ESMTP (SMTP Service Extension) protocol. The NDRs implement part of the SMTP protocol that appears on the RFC282131: ‘If an SMTP server has accepted the task of relaying the mail and later finds that the destination is incorrect or that the mail cannot be delivered for some other reason, then it must construct an ‘‘undeliverable mail’’ notification message and send it to the originator of the undeliverable mail (as indicated by the reverse-path).’ The main goal of a spammer is to achieve the correct delivery of a certain nondesired mail. To achieve this goal, some time is needed to achieve a previous target. Indirect attacks are those that use characteristics outside the spam filter in order to achieve a previous target (like obtain valid email addresses) or that camouflages a certain spam mail as being a mail server notification. These indirect attacks use the bounce messages that the mail servers send in order to achieve their goals. Three typical indirect attacks are: 1. NDR (Non-Delivery Report) Attack 2. Reverse NDR Attack 3. Directory Harvest Attack The NDR Attack consists of camouflaging the spam in an email that appears to be a NDR in order to confuse the user who can believe that he sent the initial mail that could not be delivered. The curiosity of the user may drive him to open the mail and the attachment where the spam resides. The NDR Attacks have two weak points: 1. Make intensive use of the attacker’s mail server trying to send thousands or millions of NDR like messages that many times are not even opened by the recipient. 2. If the receiver’s mail server uses a black list where the attacker’s mail server’s IP is listed, the false NDR would not ever reach their destination. Then, all the efforts made by the spammer would not be useful at all.

30 31

Available at: http://www.rfc-editor.org/rfc/rfc3834.txt. Available at: http://www.ietf.org/rfc/rfc2821.txt.

EMAIL SPAM FILTERING

101

To defeat these two weak points, the Reverse NDR Attack was devised. In a Reverse NDR Attack, the intended target’s email is used as the sender, rather than the recipient. The recipient is a fictitious email address that uses the domain name for the target’s company (for instance, ‘example.com’), such as noexist@example. com. The mail server of Example Company cannot deliver the message and sends an NDR mail back to the sender (which is the target email). This return mail carries the NDR and the original spam message attached and the target can read the NDR and the included spam thinking they may have sent the email. As can be seen in this procedure, a reliable mail server that is not in any black list and cannot be easily filtered sends the NDR mail. Another attack that exploits the bounce messages is the DHA (Directory Harvest Attack) which is a technique used by spammers attempting to find valid email addresses at a certain domain. The success of a Directory Harvest Attack relies on the recipient email server rejecting email sent to invalid recipient email addresses during the Simple Mail Protocol (SMTP) session. Any addresses to which email is accepted are considered valid and are added to the spammer’s list. There are two main techniques for generating the addresses that a DHA will target. In the first one, the spammer creates a list of all possible combinations of letters and numbers up to a maximum length and then appends the domain name. This is a standard brute force attack and implies a lot of workload in both servers. The other technique is a standard dictionary attack and is based on the creation of a list combining common first names, surnames, and initials. This second technique usually works well in company domains where the employees email addresses are usually created using their real names and surnames and not nicknames like in free mail services like Hotmail or Gmail.

6.3

Direct Attacks

The first generation of spam filters used rules to recognize specific spam features (like the presence of the word ‘Viagra’) [46]. Nowadays, as spam evolves quickly, it is impossible to update the rules as fast as new spam variations are created, and a new generation of more adaptable, learning-based spam filters has been created [46]. Direct attacks are attacks to the heart of those statistical spam filters and try to transform a given spam message into a stealthy one. The effectiveness of the attacks relies heavily on the filter type, configuration, and the previous training (mails received and set by the user as spam). One of the simplest attacks is called picospam and consists of appending random words to a short spam message, trying that those random words would be recognized as ‘good words’ by the spam filter. This attack is

102

E.P. SANZ ET AL.

very simple and was previously seen to be ineffective [63] but shows the general spirit of a direct attack to a spam filter. There are many spammer techniques [25, 91], which can be grouped into four main categories [96]: l

l

l

l

Tokenization: The attacks using tokenization work against the feature selection used by the filter to extract the main features from the messages. Examples of tokenization attacks include splitting up words with spaces, dashes, and asterisks, or using HTML, JavaScript, or CSS tricks. Obfuscation: With this kind of attacks, the message’s contents are obscured from the filter using different kinds of encodings, including HTML entity or URL encoding, letter substitution, Base64 printable encoding, and others. Statistical: These methods try to skew the message’s statistics by adding more good tokens or using fewer bad ones. There exist some variations of this kind of attacks depending on the methods used to select the used words. Weak statistical or passive attacks, that use random words and strong statistical or active attacks, which carefully select the words that are needed to mislead the filter by means of some kind of feedback. Strong statistical attacks are more refined versions of weak attacks, being more difficult to develop and their practical applicability can be questioned. Hiding the text: Some attacks try to avoid the use of words and inserts the spam messages as images, Flash, RTF, or in other file format; some other attacks insert a link to a web page where the real spam message resides. The goal is that the user could see the message but the filter could not extract any relevant feature.

We discuss instances of these attacks in the next sections.

6.3.1 Tokenization Attacks The statistical spam filters need a previous tokenization stage where the original message is transformed into a set of features describing the main characteristics of the email. A typical tokenization would count the occurrence of the words appearing in the email and would decompose the words by locating the spaces and other punctuation signals that separate the words. Tokenization attacks are conceived to attack this part of the filter, trying to avoid the correct recognition of spammy words by means of inserting spaces or other typical word delimiters inside the words. A typical example for avoiding the recognition of the word ‘VIAGRA’ would be separating the letters in this way ‘V-I.A:G-R_A.’ The user would easily recognize the word ‘VIAGRA’ but many filters would tokenize the word into multiple letters that do not represent the real content of the email.

EMAIL SPAM FILTERING

103

6.3.1.1 Hypertextus Interruptus. The tokenization attack is based on the idea of splitting the words using HTML comments, pairs of zero width tags, or bogus tags. As the mail client renders the HTML, the user would see the message as if not containing any tag or comment, but the spam filters usually separates the words according to the presence of certain tags. Some examples trying to avoid the detection of the word ‘VIAGRA’: l l l l l

VIA GRA VIAGRA VIAGRA VxyzIAGRA VIAGRA

A typical tokenizer would decompose each of the last lines into a different set of words: l l l l l

VIA, GRA VI, AGRA VIAG, R, A V, IAGRA VIAGR, A

This makes more difficult to learn to distinguish the VIAGRA-related spam as each of the spam messages received contains a different set of features. The only way to face this kind of attacks is to parse carefully the messages trying to avoid HTML comments or extracting the features from the output produced by an HTML rendering engine.

6.3.1.2 Slice and Dice. Slice and Dice means to break a certain body of information down into smaller parts and then examine it from different viewpoints. Applied to spam attacks, slice and dice consists of dividing a spam message into text columns and then rewrite the message putting each text column in a column inside an HTML table. Applying the typical VIAGRA example, we could render it using a table as follows:
VIAG RA


The user would see only VIAGRA but the tokenizer would extract one feature by each different letter in the message.

104

E.P. SANZ ET AL.

6.3.1.3

Lost in Space. This is the most basic tokenization attack, and consists of adding spaces or other characters between the letters that composes a word in order to make them unrecognizable to word parsers. ‘V*I*A*G*R*A,’ ‘V I A G R A,’ and ‘V.I.A.G.R.A’ are typical examples applying this simple and common technique. Some spam filters recognize the words by merging nonsense words separated by common characters and studying if they compose a word that could be considered as a spam feature.

6.3.2 Obfuscation Attacks In obfuscation attacks, the message’s contents are obscured to the filter using different encodings or misdirection like letter substitution, HTML entities, etc. The way these attacks affect the spam filter is very similar to the way tokenization attacks affect. The features extracted from an obfuscation attack do not correspond with the standard features of the spam previously received. Obfuscating a word like ‘VIAGRA’ can be done in many ways: l l l

‘V1AGRA’ ‘VI4GR4’ `’ ‘VI´AGRA

All the previous ways to obfuscate the word ‘VIAGRA’ produce a different feature that would be used by the spam filter to distinguish or learn how to distinguish a spam message related to the VIAGRA. A prominent form of obfuscation is the utilization of leetspeak. Leetspeak or leet (usually written as l33t or 1337) is an argot used primarily on the Internet, but becoming very common in many online video games due to the excellent reception of this argot from the youngsters who use to obfuscate their mails or SMS trying to avoid their parents to understand what they write to other friends. The leet speech uses various combinations of alphanumeric characters to replace proper letters. Typical replacements are ‘4’ for ‘A,’ ‘8’ or ‘13’ for ‘B,’ ‘(’ for ‘C,’ ‘)’ or ‘|)’ for ‘D,’ ‘3’ for ‘E,’ ‘ph’ for ‘F,’ ‘6’ for ‘G,’ ‘#’ for ‘H,’ ‘1’ or ‘!’ for ‘I,’ etc. Using these replacements, ‘VIAGRA’ could be written as ‘V14GR4’ or ‘V!4G2A,’ which can be understood by a human being but would be intelligible by a spam filter. Foreign accent is an attack very similar to the leetspeak, but do not replace letters with alphanumeric characters, it uses accented letters to substitute vocals or even characters like ‘c¸’ to substitute ‘c’ due to their similarity. ‘VIAGRA’ could be ´ GRA ´ ,’ ‘VI`A ` GRA ` ,’ ‘VI¨AGRA ¨ ,’ etc. rewritten in huge set of ways like ‘VI´A The simplest way to affront these attacks is to undo these replacements, which can be very simple in the foreign accent attacks because there exists a univocal

EMAIL SPAM FILTERING

105

correspondence between an accented letter and the unaccented letter. But the leetspeak is more difficult to translate as when a certain number or character is found, it would be needed to study whether the alphanumeric is representing a certain letter or it must continue being an alphanumeric.

6.3.3 Statistical Attacks While tokenization and obfuscation attacks are more related with the preprocessing stage of a spam filter, the statistical attacks have the main goal of attacking the heart of the statistical filter. Statistical attacks, more often called Bayesian poisoning [27, 37, 88] as most of the classifiers used to detect spam are Bayesian or Good Word Attacks [63], are based on adding random, or even carefully selected, words that are unlikely to appear in spam messages and are supposed to cause the spam filter to believe the message is not a spam (a statistical type II error). The statistical attacks have a secondary effect, a higher false-positive rate (statistical I error) because when the user trains their spam filter with spam messages containing normal words, the filter learns that these normal words are a good indication of spam. Statistical attacks are very similar, and what most vary among them is the way the words are added into the normal message and the way the words are selected. According to the word selection, there exist active attacks and passive attacks. According to the way the words are inserted in the email, there exist some variations like Invisible Ink and MIME-based attacks. Attacks can combine approaches, and for each possible variation of including the words in the message, the attack can be active or passive.

6.3.3.1 Passive Attacks. In passive attacks, the attacker constructs the word list to be used as the good words to be added in the spam messages without any feedback from the spam filter. Calburn [16] explains this process as ‘The automata will just keep selecting random words from the legit dictionary . . . When it reaches a Bayesian filtering system, [the filtering system] looks at these legitimate words and the probability that these words are associated with a spam message is really low. And the program will classify this as legitimate mail.’ The simplest passive attack consists of selecting a random set of words that would be added to all the spam messages sent by the attacker. If the same words are added to all the spam messages sent, that set of words would finish being considered as a good indicative of a spam message and the attack would convert into unproductive. Another simple yet more effective attack consists of a random selection of words per each spam mail sent or for each certain number of spam messages sent. Wittel [96] shows that the addition of random words was ineffective against the filter CRM-114 but effective against SpamBayes.

106

E.P. SANZ ET AL.

A smarter attack can be achieved by selecting common or hammy words instead of performing a random selection. Wittel [96] shows that attacks using common words are more effective against SpamBayes even when adding fewer words than when the word selection was randomizing. Instead, the work in [88] shows that by adding common words, the filter’s precision decreases from 84% to 67% and from 94% to 84% and proposes to ignore common words when performing the spam classification to avoid this performance falling.

6.3.3.2

Active Attacks. In active attacks, the attacker is allowed to receive some kind of feedback to know whether the spam filter labels a message as spam or not. Graham-Cumming [43] presents a simple way of getting this feedback by including a unique web bug at each message sent. A web bug is a graphic on a web page or inside an email that is designed to monitor who is reading a certain web or mail by counting the hits or accesses to this graphic. Having one web bug per word or per each word set allows the spammer to control what are the words that makes a spam message to look like a ham message to a certain spam filter. Lowd [63] shows that passive attacks adding random words to spam messages is ineffective as a form of attacks and also demonstrates that adding hammy words was very effective against naı¨ve Bayesian filters. Lowd [63] also shows in detail two active attacks that are very effective against most typical spam filters. The best way of preventing active attacks is to close any door that allows the spammer to receive any feedback from our system such as nondelivery reports, SMTP level errors, or web bugs.

6.3.3.3

Invisible Ink. This statistical attack consists of the addition of some real random words in the message but not letting the user to see those words. There are some variants of doing this: l l l

Add the random words before the HTML. Add an email header packer with the random words. Write a colored text on a background of the same color.

For avoiding this spammer practice, spam filters should work only with the data the user can see, avoiding headers and colored text over the same color background (Fig. 12).

6.3.4 Hidden Text Attacks These kind of attacks try to show the user a certain spam message but avoiding the spam filter to capture any feature from the content. The important point here is to place the spam message inside an image or other format, as RTF or PDF, that the

EMAIL SPAM FILTERING

107

FIG. 12. An example of image-based spam.

email client will show to the user embedded into the message. We describe several kinds of attacks that try to avoid using text, or disguise it in formats hard to process effectively and efficiently.

6.3.4.1 MIME Encoding.

MIME (Multipurpose Internet Mail Extensions) is an Internet standard that allows the email to support plain text, non-text attachments, multi-part bodies, and non-ASCII header information. In this attack, the spammer sends a MIME document with the spam message in the HTML section and any normal text in the plain text section, which makes more difficult the work of the spam filter.

6.3.4.2 Script Hides the Contents.

Most email clients parse the HTML to extract the features and avoid the information inside SCRIPT tags as they usually contain no important information when we are extracting features to detect spam contents. Recently, some attackers have made use of script languages to change the contents of the message on mouse-over event or when the email is opened. Using this technique, the spam filter would read a normal mail that would be converted into a spam mail when opened by the email client.

6.3.4.3 Image-Based Spam.

The first image-based spam was reported by Graham-Cumming in 2003 and was very simple; it included only an image with the spam text inside. At that time images were loaded from Web sites using simple HTML image tags, but as email clients started to avoid the remote image load, spammers started to send the images as MIME attachments. First attempts to detect this kind of spam made the developers to use OCR (Optical Character Recognition) techniques to pull words out of the images and use them as features to classify the message as spam or ham.

108

E.P. SANZ ET AL.

But OCR is computationally expensive; its accuracy leaves much to be desired and can be easily mislead by adding noise to the images or even obfuscating the contents with the same techniques used to obfuscate the text content in text spam. According to IronPort statistics [49], 1% of spam was image based in June 2005 and one year later, it had risen to 16%. Cosoi [23] shows that the evolution of imagebased spam was from 5% by March 2006 to almost 40% at the end of 2006. Samosseiko [83] asserts that more than the 40% of the spam seen in SophosLabs at the end of 2006 is image-based spam. All these numbers show the real importance of image-based spam in the present spam’s world (Fig. 13). For a more detailed review of actual image-based techniques, Cumming [45] shows the evolution of image spam along the last years, from single images containing the whole spam, combinations (more or less complicated) of different images to show the spam to the user, integrating images with tokenization and obfuscation attacks, using strange fonts, adding random pixels to avoid OCR, hashing, etc. Interestingly, a number of researchers have devised approaches to deal with specific form of image spam. In particular, Biggio and others [10] have proposed to detect spam email by using the fact that spammers try to add noise to images in order to avoid OCRs, what it is called ‘obscuring’ by the authors. Also, and in a more general approach, Byun and others make use of a suite of synthesized attributes of

FIG. 13. An example of image-based spam with noise in order to avoid text recognition using OCR software.

EMAIL SPAM FILTERING

109

spam images (color moment, color heterogeneity, conspicuousness, and selfsimilarity) in order to characterize a variety of image spam types [12]. These are promising lines of research, and combined with other techniques offer the possibility of high accuracy filtering.

6.3.4.4 Spam Using Other Formats.

As spam filter developers increase the precision of their software, spammers develop new variation of their attacks. One simple variation of image-based attacks consists of using a compressed PDF instead of an image to place the spam content trying to avoid the OCR scanning as PDF do not use to contain spam. Another attack consists of embedding RTF files, containing the spam message, which are sniffed by Microsoft email clients.

7.

Conclusions and Future Trends

Spam is an ever growing menace that can be very harmful. Its effects could be very similar to those produced by a Denial of Service Attack (DoS). Political, economical, legal, and technical measures are not enough to end the problem, and only a combination of all of them can lower the harm produced by it. Among all those approaches, content-based filters have been the best solution, having big impact in spammers that have had to search new ways to pass those filters. Luckily, systems based on Machine Learning algorithms allow the system to learn adapt to new treats, reacting to countermeasures used by spammers. Recent competitions in spam filtering have shown that actual systems can filter out most of the spam, and new approaches like those based on compression can achieve high accuracy ratios. Spammers have designed new and refined attacks that hit one of the critical steps in every learning method: the tokenization process, but compression-based filters have been very resistant to this kind of attack. References [1] Abadi M., Birrell A., Burrows M., Dabek F., and Wobber T., December 2003. Bankable postage for network services. In Proceedings of the 8th Asian Computing Science Conference, Mumbai, India. [2] Ahn L. V., Blum M., and Langford J., February 2004. How lazy cryptographers do AI. Communications of the ACM. [3] Anderson R., 2004. Taking a bit out of spam. Network Computing, Magazine Article, May 13. [4] Androutsopoulos I., Koutsias J., Chandrinos K. V., Paliouras G., and Spyropoulos C. D., 2000. An evaluation of Naive Bayesian anti-spam filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML), pp. 9–17. Barcelona, Spain.

110

E.P. SANZ ET AL.

[5] Androutsopoulos I., Paliouras G., Karkaletsis V., Sakkis G., Spyropoulos C. D., and Stamatopoulos P., 2000. Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory-based approach. In Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 1–13. Lyon, France. [6] Androutsopoulos I., Koutsias J., Chandrinos K. V., and Spyropoulos C. D., 2000. An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167. Athens, Greece, ACM Press, New York, US. [7] Belkin N. J., and Croft W. B., 1992. Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12): 29–38. [8] Bell S., 2003. Filters causing rash of false positives: TelstraClear’s new virus and spam screening service gets mixed reviews. http://computerworld.co.nz/news.nsf/news/CC256CED0016AD1ECC256DAC000D90D4? Opendocument. [9] Bickel S., September 2006. ECML/PKDD discovery challenge 2006 overview. In Proceedings of the Discovery Challenge Workshop, 17th European Conference on Machine Learning (ECML) and 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany. [10] Biggio B., Giorgio Fumera, Ignazio Pillai, and Fabio Roli, August 23, 2007. Image spam filtering by content obscuring detection. In Proceedings of the Fourth Conference on Email and Anti-Spam (CEAS 2007), pp. 2–3. Microsoft Research Silicon Valley, Mountain View, California. [11] Bratko A., Cormack G. V., Filipic B., Lynam T. R., and Zupan B., Dec 2006. Spam filtering using statistical data compression models. Journal of Machine Learning Research, 7: 2699–2720. [12] Byun B., Lee C.-H., Webb S., and Calton P., August 2–3, 2007. A discriminative classifier learning approach to image modeling and spam image identification. In Proceedings of the Fourth Conference on Email and Anti-Spam (CEAS 2007), Microsoft Research Silicon Valley, Mountain View, California. [13] Caropreso M. F., Matwin S., and Sebastiani F., 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice, A. G. Chin, editor, pp. 78–102. Idea Group Publishing, Hershey, US. [14] Carreras X., and Ma´rquez L., 2001. Boosting trees for anti-spam email filtering. In Proceedings of RANLP-2001, 4th International Conference on Recent Advances in Natural Language Processing. [15] Caruso J., December 2003. Anti-spam law just a start, panel says. Networld World, http://www. networkworld.com/news/2003/1218panel.html. [16] Claburn T., 2005. Constant struggle: How spammers keep ahead of technology, Message Pipeline. http://www.informationweek.com/software/messaging/57702892. [17] Cleary J. G., and Teahan W. J., 1997. Unbounded length contexts for PPM. The Computer Journal, 40(2/3): 67–75. [18] Cohen W. W., and Hirsh H., 1998. Joins that generalize: Text classification using WHIRL. In Proceedings of KDD-98, 4th International Conference on Knowledge Discovery and Data Mining, pp. 169–173. New York, NY. [19] Cormack G. V., and Lynam T. R., 2005. TREC 2005 spam track overview. In Proc. TREC 2005 – the Fourteenth Text REtrieval Conference, Gaithersburg. [20] Cormack G. V., and Lynam T. R., July 2005. Spam corpus creation for TREC. In Proc. CEAS 2005 – The Second Conference on Email and Anti-spam, Palo Alto.

EMAIL SPAM FILTERING

111

[21] Cormack G. V., and Bratko A., July 2006. Batch and on-line spam filter evaluation. In CEAS 2006 – Third Conference on Email and Anti-spam, Mountain View. [22] Cormack G., Go´mez Hidalgo J. M., and Puertas Sanz E., November 6-9, 2007. Spam filtering for short messages. In ACM Sixteenth Conference on Information and Knowledge Management (CIKM 2007), Lisboa, Portugal. [23] Cosoi C. A., December 2006. The medium or the message? Dealing with image spam. Virus Bulletin, http://www.virusbtn.com/spambulletin/archive/2006/12/sb200612-image-spam.dkb. [24] Cranor L. F., and LaMacchia B. A., 1998. Spam! Communications of the ACM, 41(8): 74–83. [25] Dalvi N., Domingos P., Sanghai M. S., and Verma D., 2004. Adversarial classification. In Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining, pp. 99–108. ACM Press, Seattle, WA. [26] Dantin U., and Paynter J., 2005. Spam in email inboxes. In 18th Annual Conference of the National Advisory Committee on Computing Qualifications, Tauranga, New Zealand. [27] Deerwester S., Dumais S. T., Furnas G. W., Landauer T. K., and Harshman R., 1990. Indexing by latent semantic indexing. Journal of the American Society for Information Science, 41(6): 391–407. [28] Domingos P., and Pazzani M. J., 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2–3): 103–130. [29] Domingos P., 1999. MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 155–164. San Diego, CA, ACM Press. [30] Drucker H., Vapnik V., and Wu D., 1999. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5): 1048–1054. [31] Dumais S. T., Platt J., Heckerman D., and Sahami M., 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, G. Gardarin, J. C. French, N. Pissinou, K. Makki, and L. Bouganim, eds. pp. 148–155. ACM Press, New York, US, Bethesda, US. [32] Dwork C., Goldberg A., and Naor M., August 2003. On memory-bound functions for fighting spam. In Proceedings of the 23rd Annual International Cryptology Conference (CRYPTO 2003). [33] Eckelberry A., 2006. What is the effect of Bayesian poisoning? Security Pro Portal, http://www. netscape.com/viewstory/2006/08/21/what-is-the-effect-of-bayesian-poisoning/. [34] Fawcett T., 2003. ‘‘In vivo’’ spam filtering: A challenge problem for KDD. SIGKDD Explorations, 5 (2): 140–148. [35] Fuhr N., Hartmann S., Knorz G., Lustig G., Schwantner M., and Tzeras K., 1991. AIR/X—a rulebased multistage indexing system for large subject fields. In Proceedings of RIAO-91, 3rd International Conference ‘‘Recherche d’Information Assistee par Ordinateur,’’ pp. 606–623. Barcelona, Spain. [36] Garcia F. D., Hoepman J.-H., and van Nieuwenhuizen J., 2004. Spam filter analysis. In Security and Protection in Information Processing Systems, IFIP TC11 19th International Information Security Conference (SEC2004), Y. Deswarte, F. Cuppens, S. Jajodia, and L. Wang, eds. pp. 395–410. Toulouse, France. [37] Gee K., and Cook D. J., 2003. Using latent semantic Iidexing to filter spam. ACM Symposium on Applied Computing, Data Mining Track. [38] Go´mez-Hidalgo J. M., Man˜a-Lo´pez M., and Puertas-Sanz E., 2000. Combining text and heuristics for cost-sensitive spam filtering. In Proceedings of the Fourth Computational Natural Language Learning Workshop, CoNLL-2000, Association for Computational Linguistics, Lisbon, Portugal. [39] Go´mez-Hidalgo J. M., 2002. Evaluating cost-sensitive unsolicited bulk email categorization. In Proceedings of SAC-02, 17th ACM Symposium on Applied Computing, pp. 615–620. Madrid, ES.

112

E.P. SANZ ET AL.

[40] Go´mez-Hidalgo J. M., Man˜a-Lo´pez M., and Puertas-Sanz E., 2002. Evaluating cost-sensitive unsolicited bulk email categorization. In Proceedings of JADT-02, 6th International Conference on the Statistical Analysis of Textual Data, Madrid, ES. [41] Goodman J., 2004. IP addreses in email clients. In Proceedings of The First Conference on Email and Anti-Spam. [42] Graham-Cumming J., 2003. The Spammer’s Compendium. In MIT Spam Conference. [43] Graham-Cumming J., 2004. How to beat an adaptive spam filter. In MIT Spam Conference. [44] Graham-Cumming J., February 2006. Does Bayesian poisoning exist? Virus Bulletin. [45] Graham-Cumming J., November 2006. The rise and rise of image-based spam. Virus Bulletin. [46] Graham P., 2002. A plan for spam. Reprinted in Paul Graham, Hackers and Painters, Big Ideas from the Computer Age, O’Really (2004). Available: http://www.paulgraham.com/spam.html. [47] Graham P., January 2003. Better Bayesian filtering. In Proceedings of the 2003 Spam Conference. Available:http://www.paulgraham.com/better.html. [48] Gray A., and Haahr M., 2004. Personalised, collaborative spam filtering. In Proceedings of the First Conference on Email and Anti-Spam (CEAS). [49] Hahn J., 2006. Image-based spam makes a comeback. Web Trends, http://www.dmconfidential.com/ blogs/column/Web_Trends/916/. [50] Hall R. J., March 1998. How to avoid unwanted email. Communications of the ACM. [51] Hird S., 2002. Technical solutions for controlling spam. In Proceedings of AUUG2002, Melbourne. [52] Hovold J., 2005. Naive Bayes spam filtering using word-position-based attributes. In Proceedings of the Second Conference on Email and Anti-spam, CEAS, Stanford University. [53] InfoWorld Test Center. Strong spam combatants: Brute anti-spam force takes on false-positive savvy. Issue 22 May 31, 2004. [54] Jeffrey E., and Friedl F., August 2006. Mastering Regular Expressions. 3rd edn. O’Really. [55] Joachims T., 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pp. 137–142. Chemnitz, Germany. [56] Joachims T., 1999. Transductive inference for text classification using support vector machines. In Proceedings of ICML-99, 16th International Conference on Machine Learning, pp. 200–209. Bled, Slovenia. [57] Keogh E., Lonardi S., and Ratanamahatana C. A., 2004. Towards parameter-free data mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215. Seattle, WA, USA, August 22–25, 2004). KDD ’04. ACM, New York, NY. [58] Kolcz A., Chowdhury A., and Alspector J., 2004. The impact of feature selection on signature-driven spam detection. In Proceedings of the First Conference on Email and Anti-Spam (CEAS). [59] Larkey L. S., and Croft W. B., 1996. Combining classifiers in text categorization. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, H.-P. Frei, D. Harman, P. Scha¨uble, and R. Wilkinson, eds. pp. 289–297. ACM Press, New York, US, Zurich, CH. [60] Lewis D. D., and Gale W. A., 1994. A sequential algorithm for training text classifiers. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, W. B. Croft and C. J. van Rijsbergen, eds. pp. 3–12. Springer Verlag, Heidelberg, DE, Dublin, IE. [61] Lewis D. D., 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on, Machine Learning, C. Ne´dellec and

EMAIL SPAM FILTERING

[62] [63] [64] [65] [66] [67] [68] [69] [70] [71]

[72]

[73] [74]

[75] [76] [77]

[78] [79]

[80]

[81] [82]

113

C. Rouveirol, eds. pp. 4–15. Springer Verlag, Heidelberg, DE, Chemnitz, DE. Lecture Notes in Computer Science, 1398. Li Y. H., and Jain A. K., 1998. Classification of text documents. The Computer Journal, 41(8): 537–546. Lowd D., and Meek C., 2005. Adversarial Learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), ACM Press, Chicago, IL. Lucas M. W., 2006. PGP & GPG: email for the Practical Paranoid. No Starch Press. McCallum A., and Nigam K., 1998. A comparison of event models for Naive Bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization. MessageLabs, 2006. MessageLabs Intelligence: 2006 Annual Security Report. Available: http:// www.messagelabs.com/mlireport/2006_annual_security_report_5.pdf. Meyer T. A., and Whateley B., 2004. Spambayes: Effective open-source, bayesian based, email classification system. In Proceedings of the First Conference on Email and Anti-spam (CEAS). Mitchell T. M., 1996. Machine learning. McGraw Hill, New York, US. O’Brien C., and Vogel C., September 2003. Spam filters: Bayes vs. chi-squared; letters vs. words. In Proceedings of the International Symposium on Information and Communication Technologies. Pampapathi R., Mirkin B., and Levene M., 2006. A suffix tree approach to anti-spam email filtering. Mach. Learn, 65(1): 309–338. Pantel P., and Lin D., 1998. Spamcop: A spam classification and organization program. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin. AAAI Technical Report WS-98-05. Platt J., 1998. Fast training of support vector machines using sequential minimal optimization. B. Scho¨lkopf, C. Burges, and A. Smola, eds. Advances in Kernel Methods – Support Vector Learning. Postini White Paper, 2004. Why content filter is no longer enough: Fighting the battle against spam before it can reach your network. Postini Pre-emptive email protection. Provost F., and Fawcett T., 1997. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. Provost J., 1999. Naive-bayes vs. rule-learning in classification of email. Technical report Department of Computer Sciences at the University of Texas at Austin. Quinlan R., 1986. Induction of decision trees. Machine Learning, 1(1): 81–106. Rigoutsos I., and Huynh T., 2004. Chung-kwei: A pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (spam). In Proceedings of the First Conference on Email and Anti-Spam (CEAS). Saarinen J., 2003. Spammer ducks for cover as details published on the web, NZHerald. http://www. nzherald.co.nz/section/1/story.cfm?c_id=1&objectid=3518682. Sahami M., Dumais S., Heckerman D., and Horvitz E., 1998. A Bayesian approach to filtering junk e-mail. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, AAAI Press, Madison, WI. Sakkis G., Androutsopoulos I., Paliouras G., Karkaletsis V., Spyropoulos C. D., and Stamatopoulos P., 2001. Stacking classifiers for anti-spam filtering of e-mail. In Proceedings of EMNLP-01, 6th Conference on Empirical Methods in Natural Language Processing, Pittsburgh, US, Association for Computational Linguistics, Morristown, US. Salton G., 1981. A blueprint for automatic indexing. SIGIR Forum, 16(2): 22–38. Salton G., and McGill M. J., 1983. Introduction to Modern Information Retrieval. McGraw Hill, New York, US.

114

E.P. SANZ ET AL.

[83] Samosseiko D., and Thomas R., 2006. The game goes on: An analysis of modern spam techniques. In Proceedings of the 16th Virus Bulletin International Conference. [84] Sculley D., and Brodley C. E., 2006. Compression and machine learning: a new perspective on feature space vectors. In Data Compression Conference (DCC’06), pp. 332–341. [85] Sebastiani F., 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1–47. [86] Seigneur J.-M., and Jensen C. D., 2004. Privacy recovery with disposable email addresses. IEEE Security and Privacy, 1(6): 35–39. [87] Sergeant M., 2003. Internet-level spam detection and SpamAssassin 2.50. In Spam Conference. [88] Stern H., Mason J., and Shepherd M., A linguistics-based attack on personalized statistical e-mail classifiers. Technical report CS-2004-06, Faculty of Computer Science, Dalhousie University, Canada. March 25, 2004. [89] Taylor B., 2006. Sender reputation in a large webmail service. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), Mountain View, California. [90] Teahan W. J., and Harper D. J., 2003. ‘‘Using compression based language models for text categorization’’. In Language Modeling for Information Retrieval, W. B. Croft and J. Laferty, eds. The Kluwer International Series on Information Retrieval, Kluwer Academic Publishers. [91] Theo V. D., 2004. New and upcoming features in SpamAssassin v3, ApacheCon. [92] Thomas R., and Samosseiko D., October 2006. The game goes on: An analysis of modern spam techniques. In Virus Bulletin Conference. [93] Turing A. M., 1950. Computing machinery and intelligence. Mind, 59: 433–460. [94] Watson B., 2004. Beyond identity: Addressing problems that persist in an electronic mail system with reliable sender identification. In Proceedings of the First Conference on Email and Anti-Spam (CEAS), Mountain View, CA. [95] Wiener E. D., Pedersen J. O., and Weigend A. S., 1995. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332. Las Vegas, US. [96] Wittel G. L., and Wu F., 2004. On attacking statistical spam filters. In Proceedings of the Conference on Email and Anti-spam (CEAS). [97] Witten I. H., and Frank E., 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, Los Altos, US. [98] Yang Y., and Chute C. G., 1994. An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems, 12(3): 252–277. [99] Yang Y., and Pedersen J. O., 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning. D. H. Fisher, editor. [100] Yerazunis B., 2003. Sparse binary polynomial hash message filtering and the CRM114 discriminator. In Proceedings of the Spam Conference. [101] Yerazunis B., 2004. The plateau at 99.9. In Proceedings of the Spam Conference. [102] Zdziarski J., 2004. Advanced language classification using chained tokens. In Proceedings of the Spam Conference. [103] Zimmermann P. R., 1995. The Official PGP User’s Guide. MIT Press.

ARTÍCULOS

Artículo 2: Web Content Filtering Puertas Sanz, E., Gómez Hidalgo, J. M., Carrero, F., Buenaga, M.. (2009). Web Content Filtering. Advances in Computers – Elsevier Academic Press, (Vol. 76, 257-306). Impacto Este artículo de revista ha sido publicado por la editorial Elsevier. Se encuentra indexado en índices científicos de referencia como Journal Citation Reports (JCR), en el cuarto cuartil de la categoría “Computer science, Software Engineering”, y en Scimago Journal Rank (SJR), en el segundo cuartil de la categoría “Computer Science”. Resumen La popularidad de la World Wide Web, así como la naturaleza democrática de Internet (porque cualquier usuario puede aportar sus propios contenidos), la hace propensa a la violación de derechos. Se puede

hacer

un uso abusivo

de la Web de

dos maneras,

principalmente: publicando contenido inapropiado o accediendo a contenido inapropiado. Como consecuencia, los filtros de Internet y el software de control ha surgido como la herramienta ideal para impedir el uso indiscriminado de Internet en el lugar de trabajo, y para ayudar a los padres y a las instituciones públicas a impedir que los niños accedan a contenido inapropiado. Estas herramientas de software se centran en impedir el acceso a determinado tipo de contenido Web, así como en controlar la navegación del usuario de Internet para su posterior auditoría, si se considerara necesario. Así, los filtros web han

137

ARTÍCULOS

evolucionado hacia la inclusión de técnicas más sofisticadas y efectivas, desde el análisis inteligente de texto hasta el procesado de imagen, para cubrir no solo la información entrante (por ejemplo, contenido Web), sino también la información saliente (secretos industriales, números de tarjetas de crédito, etc.), y analizar un número más amplio de tipos de contenido y de protocolos (mensajería instantánea, P2P, juegos online concretos, etc.). Este artículo se dedica a tratar asuntos relacionados con el software de filtrado de contenido web, yendo de las aplicaciones, a las técnicas (centrándonos en el análisis inteligente de contenido), detalles de implementación y medidas contra los ataques. Vemos también como evaluar los filtros, tanto desde un punto de vista industrial, como desde un punto de vista científico. Aportaciones Si en el anterior artículo describíamos como se puede mejorar la efectividad de los filtros de spam haciendo uso de Ingeniería de Atributos, en este artículo proponemos que se puede trasladar este enfoque también a la tarea de filtrado web, para mejorarla. Se trata de una tarea en la que tradicionalmente no se hacía mucho énfasis en esta fase del proceso de Aprendizaje, y en este artículo vemos que es justo unos de sus puntos débiles, ya que es a dónde van dirigidos la mayoría de los ataques. Por eso proponemos el uso de técnicas de Ingeniería del Lenguaje que mejoren la representación de los documentos, y lo evaluamos con un conjunto de algoritmos de aprendizaje para comprobar que efectivamente se mejora en los resultados.

138

ARTÍCULOS

Nuevamente presentamos que este tipo de tareas deben evaluarse usando métricas adecuadas como la ya presentada ROCCH, ya que en este tipo de tareas los costes también son asimétricos y desconocidos.

139

ARTÍCULOS

140

Web Content Filtering ´ MEZ HIDALGO JOSE´ MARI´A GO Optenet, Departamento de IþD, C/Jose´ Echegaray 8, edificio 3, Parque Empresarial Alvia, Las Rozas, 28230 Madrid, Spain

ENRIQUE PUERTAS SANZ Universidad Europea de Madrid, Villaviciosa de Odo´n, 28670 Madrid, Spain

FRANCISCO CARRERO GARCI´A Universidad Europea de Madrid, Villaviciosa de Odo´n, 28670 Madrid, Spain

MANUEL DE BUENAGA RODRI´GUEZ Universidad Europea de Madrid, Villaviciosa de Odo´n, 28670 Madrid, Spain

Abstract Across the years, Internet has evolved from an academic network to a truly communication medium, reaching impressive levels of audience and becoming a billionaire business. Many of our working, studying, and entertainment activities are nowadays overwhelmingly limited if we get disconnected from the net of networks. And of course, with the use comes abuse. The World Wide Web features a wide variety of content that are harmful for children or just inappropriate in the workplace. Web filtering and monitoring systems have emerged as valuable tools for the enforcement of suitable usage policies. These systems are routinely deployed in ADVANCES IN COMPUTERS, VOL. 76 ISSN: 0065-2458/DOI: 10.1016/S0065-2458(09)01007-9

257

Copyright © 2009 Elsevier Inc. All rights reserved.

258

J. M. G. HIDALGO ET AL. corporate, library, and school networks, and contribute to detect and limit Internet abuse. Their techniques are increasingly sophisticated and effective, and their development is contributing to the advance of the state of the art in a number of research fields, like text analysis and image processing. In this chapter, we review the main issues regarding Web content filtering, including its motivation, the main operational concerns and techniques used in filtering tools’ development, their evaluation and security, and a number of singular projects in this field.

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 2. Motivation and Applications . . . . . . . . . . . . . . . . . . . . . . . . 260 2.1. 2.2. 2.3.

Controlling Internet Abuse at the Workplace . . . . . . . . . . . . . . . . 261 Children Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Internet Filtering and Free Speech . . . . . . . . . . . . . . . . . . . . . . 268

3. Web Filters Operation and Techniques . . . . . . . . . . . . . . . . . . . 269 3.1. 3.2.

Operational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Filtering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

4. Text-Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 4.1. 4.2. 4.3. 4.4.

Text Classification Tasks . . . . . . . . . . . . . . . . Text Categorization Types . . . . . . . . . . . . . . . . Text Classification Process . . . . . . . . . . . . . . . Web Content Filtering as Text Categorization . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

281 283 283 285

5. Image Processing Techniques . . . . . . . . . . . . . . . . . . . . . . . . 286 5.1. 5.2.

Adult Image Recognition Using Skin Detection . . . . . . . . . . . . . . . 287 Adult Image Recognition Using Wavelets . . . . . . . . . . . . . . . . . . 290

6. Evaluation of Web Filters . . . . . . . . . . . . . . . . . . . . . . . . . . 290 6.1. 6.2.

Industrial Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Scientific Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

7. Attacks and Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . 296 7.1. 7.2. 7.3.

Disguising and Wrong Self-Labeling . . . . . . . . . . . . . . . . . . . . 296 Proxies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Anonymization Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 298

8. Review of Singular Projects . . . . . . . . . . . . . . . . . . . . . . . . . 298 8.1. 8.2. 8.3.

Wavelet Image Pornography Elimination . . . . . . . . . . . . . . . . . . 299 Public Open-Source Environment for Safer Internet Access . . . . . . . . . 299 NetProtect I and II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

9. Conclusions and Future Trends . . . . . . . . . . . . . . . . . . . . . . . 302 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

WEB CONTENT FILTERING

259

1. Introduction Internet emerged as an important tool of communication for researchers in the academic world, but since the emergence of the World Wide Web, it has quickly evolved into an extremely valuable too in work and business, study and entertainment. Many workers are actively engaged with tasks that involved Web access, including marketing research and customer care, dealing with providers, buying and selling products, and even traditional tasks including human resource management and enterprise resource planning can be performed through Web applications. Students often use the Web as a primary research tool, stay connected to their teachers and other students, and use online learning applications. And in general, the Web is a first-order tool that allows its users to keep in touch with their family, friends, and colleagues through social networks like MySpace, to find and buy products like music or movies, to book travels, to find for new jobs, to play online games, etc. [57]. The popularity of the World Wide Web, and the democratic nature of the Internet (as any user can post their own content to it), makes it prone to abuse. The Web can mainly be abused in two ways: through posting inappropriate content, or through accessing inappropriate content. This chapter primary deals with the second type of abuse. Most Internet users in democratic countries should agree that accessing to some types of Web content is inappropriate depending on the time and place. A foremost example is the workplace. Often, workers employ at their workplace Internet access to visit inappropriate sites: porn, gambling, job search, online games, entertainment, etc., producing important economic losses for their corporations [40, 62]. Leaving the legality of some types of content apart, it is clear that accessing them at the workplace is an abuse, as the employer is providing Internet access to workers as a work tool. Access to these sites may be legitimate when done at home or cybercafe´s, but not at the workplace. Additionally, Web site visitors are rarely identified, and most often, the age of the visitor is simply ignored. Some of the Web contents may be suitable for adults, but they are served without control to children and youngsters. Examples of these contents include pornography, online gambling, dating systems, etc. While legal (depending on the country), these contents are simply not appropriate for children. Moreover, there are evidences of pathologic addiction to these contents even among adults [5]. Internet has also emerged as a means of distributing illegal (or barely legal) content, like child pornography, violence, racism and sects, Web pages promoting

260

J. M. G. HIDALGO ET AL.

anorexia and bulimia, software cracks, etc. All Web users should be protected against these kinds of content, but special attention must be paid to children. There are regulatory efforts that aim at protecting children in public institutions like schools and libraries, enforcing the utilization of tools for preventing children access to inappropriate content. However, any regulatory approach to children protection and the limitation of illegal content is doomed by the international nature of Internet, and it is limited by the necessity of protecting free speech on the media. As a consequence, Internet filters and monitoring software has emerged as a tool for avoiding Internet abuse at the workplace, and for aiding parents and public institutions to prevent children access to unsuitable contents. The goal of these software tools is to disallow the access to some kinds of Web content, and to monitor the browsing activity of Internet user for further inspection if needed. Today, Internet filters are a part of many Internet security tools, for perimeter protection, as antivirus, antispam or firewall software, and routinely deployed at corporations and educational institutions by system administrators. Most often, their deployment is enforced by the law, or just accepted by the workers as a part of an agreed acceptable Internet usage policy [55]. Web filters have started as simple tools able to detect and forbid or monitor access to listed Web sites in URL databases, or to Web pages containing a limited number of keywords. As the number of Web pages and sites is always increasing, and the lists and keywords must be manually managed, the URL and keyword-matching approaches are of limited effectiveness. So, Web filters have evolved to include more sophisticate and effective techniques ranging from to intelligent text analysis to image processing, to cover not only incoming information (e.g., Web content), but also outgoing information (business secrets, credit card numbers, etc.), and to check a wider range of content types and protocols (instant messaging, peer to peer, specific online games, etc.) [28]. With this evolution, filters have even promoted important developments and innovations in some research fields like image processing (running from [19] to more recent works like [36]). This chapter aims at covering most issues regarding Web content-filtering software, from applications to techniques (with focus on intelligent content analysis), implementation details, and attacks and countermeasures.

2.

Motivation and Applications

The increasing availability of inappropriate, dangerous and illegal content in the Web has motivated the emergence of Internet filters and monitor as a protection and enforcement tool. In this section, we discuss the main scenarios of application of this kind of tools, along with the risks regarding privacy and information censorship.

WEB CONTENT FILTERING

2.1

261

Controlling Internet Abuse at the Workplace

Internet services are essential in modern corporations, with email as a dominant communication channel between workers and with providers and customers, and the Web routinely used for market research and marketing, as a business to business and to consumer platform, etc. But as the Internet contains much entertainment information (from the pornography and gambling industry, to news, travels, etc.), it is being used by employers to waste time and resources in nonwork tasks. When ethically used, access to recreational Web sites can make employees more informed, happy, satisfied and possibly more productive [58]. The words cyberslacking and cyberloafing are being used to make reference to Internet abuse, defined in Lim et al. [40] as ‘‘any voluntary act of employees using their companies’ Internet access during office hours to surf nonwork-related Web sites for nonwork purposes, and access (including receiving and sending) nonworkrelated email’’ (p. 67). In Siau et al. [55], a number of Internet-related abuses are described, including: l

l

l

l

l

l

l

Copyright infringement, plagiarism. Using illegal or pirated software that cost organizations millions of dollars because of copyright infringements. Copying of Web sites and copyrighted logos Transmission of confidential data. Using the Internet to display or transmit trade secrets Pornography. Accessing sexually explicit sites from workplace as well as the display, distribution, and surfing of these offensive sites Nonwork-related download or upload. Propagation of software that ties up office bandwidth. Programs such as Emule and BitTorrent allow the transmission of movies, music, and graphical materials Leisure use of the Internet. Loafing around the Internet, which includes shopping, sending e-cards and personal email, gambling online, chatting, game playing, auctioning, stock trading, and doing other personal activities Usage of external Internet service providers. Using an external Internet service provider (ISP) to connect to the Internet to avoid detection Moonlighting. Using office resources such as networks and computers to organize and conduct personal business (side jobs)

Of course, not all these kinds of abuse are related to Web content, and more importantly, not all employees are guilty of Internet abuse. In Websense, Inc., [62], the results of a survey of Internet activities are presented. The results of 286 workers’ answers to the question: ‘‘Do you ever access each of the following types of Web sites from work?’’ are presented in Fig. 1. As it can be seen, a high proportion

262

J. M. G. HIDALGO ET AL.

93%

Work-related sites 83%

Map sites

80%

News

76%

Weather 69%

Government Educational

63%

Banking

57%

Travel

56%

Personal email

49%

Shopping

48%

Auction

34%

Real estate

32%

Sports

30%

Investment/Stock

29%

Job search

26%

Free software

21%

Video

17%

Blogs

11%

Music download

10%

Other

9%

Online communities

8%

Games

6%

P2P

4%

Hacking tools

3%

Dating

3%

Keyloggers

2%

Gambling

1%

Lingerie/Adult content

1%

None Hate

1% 0% 0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

FIG. 1. Percentage of users and types of Web sites accessed at the workplace.

of employees admit using map, news, and weather sites, which are most often nonwork sites. In fact, of those employees who access nonwork-related Web sites, the average time accessing the Internet at work is 12.81 h, and the average time accessing nonwork-related Web sites at work is 3.06 h. Remarkably, the study was performed before the explosion of social networks like MySpace or Facebook, which have attracted many users in the latest 2 years. A number of the types of content accessed may be legitimate under several circumstances:

WEB CONTENT FILTERING l

l

263

The existence of an acceptable Internet usage policy that states the time in which employees can make a personal use of Internet access (e.g., at lunch time). The work needs of the employee involve accessing some of the types of content, like online banking in a Financial Department, job search in a Human Resources Department, travel for a secretary in charge of travel planning, etc.

Moreover, letting the workers accessing some types of content (online banking or shopping) may make them more productive, as some regulations allow employees to ask for time for personal administrative tasks (e.g., once a week or month). In consequence, workers may not ask for this personal time if they can do their administrative tasks online at the workplace. The access of workers to inappropriate Web content can have an important impact on the corporation [63]: l

l

l

Productivity loss. The time invested in personal use of the Web can dramatically decrease employee’s productivity. Several studies report different figures to estimate the economic consequences, but the actual cost depends on the size of the company, the salary of the workers and the importance of the abuse. Regarding this, we are not aware of truly reliable statistics, but as a example, a corporation with 100 employees wasting an average 10% of their time in personal use of the Web, and an average salary of $30,000 a year, may be wasting $300,000 a year because of this reason. Bandwidth waste. As a limited resource, positive Internet access of some workers can be very slow, or just impossible, if other less ethical users are wasting it in: downloading big files like movies, music, or software programs; connecting to audio and video streaming Web sites; playing online games or visiting virtual worlds; visiting rich media (e.g., map) sites; etc. In short, let us think about the Internet access as having a single telephone set in an office, and a worker permanently making personal phone calls; the consequence is that the other workers will not be able to use it to communicate to customers. Legal liability. For a number of illegal infringements, a corporation may be liable on behalf of the worker that has actually committed the offence. This may include, depending of the country regulation, a number of crimes that are a consequence of Web downloads, including software piracy, possession and/or distribution of pedophilic material, possession and/or distribution of offensive material (running from sexual harassment to Nazi or sects publicity), copyright infringement, and several others. Also, the Web can be used to several illegal activities like hacking, defamation, fraud, etc., that may be done by an employee at the workplace.

264 l

J. M. G. HIDALGO ET AL.

Security breaches. On one side, the corporation workstations can be infected by a number of malware programs that are currently distributed through dangerous Web sites. Also, the infected workstations can become ‘‘zombies’’ used to send spam, host illegal Web sites, make distributed denial of service attacks, etc. On the other side, unethical workers can reveal sensitive information or just corporate secrets through the Web.

Corporations have to address Web access control in order to make it productive and avoid these risks. While legally having the right to run Web filtering and monitoring software, several researchers have performed studies that demonstrate that these software programs should be used to enforce agreed acceptable Internet usage policies. In general, Internet abuse and even addiction must be approached as a Human Resources issue [21]: l l

l l

Educating managers and employees on the signs of Internet abuse Creating better policies regarding what employers expect from employees’ use of the Internet at work Offering resources to employees who get caught in the Web In extreme cases, taking disciplinary actions

An Internet usage policy defines appropriate behavior when using company Internet resources and outlines the ramifications for violations [56]. In particular, the employer should be sure of covering all the abuses sketched above: copyright infringement and plagiarism, transmission of confidential data, pornography, nonwork-related download or upload, leisure use of the Internet, usage of external ISPs, moonlighting, etc. There are some guidelines for defining a good Internet usage policy [55]: l

l

l l

l

l

l

State the company’s values. These values may include profit making, professionalism, and cost-saving endeavors. The policy should complement the code for ethical computer use, and other codes and policies of the company. Make it clear the company’s system should be used only for business purposes. Emphasize that the company reserves the right to monitor all forms of Internet and email use, and list all types of monitoring carried out. Stress that transmission, display, or storage of sexually explicit, defamatory, or offensive materials is strictly prohibited at all times. Enforce policy in a consistent and uniform manner, and assure disciplinary action will follow if there is a violation of policy. Involve employees in the AIUP development process and ensure that employees understand and agree with the policy.

WEB CONTENT FILTERING

265

As Internet and the behavior of workers evolve, the Internet usage policy must be suitably managed with Simmers [56]: l

l l

l

l

Periodic (weekly, monthly, and bimonthly) generation of Internet usage reports to allow feedback on policy compliance Discussion of these reports at appropriate levels of the organization Actions taken against those who violate policy, per action steps established in the policy Addition of Web sites identified in usage reports as inappropriate to the filtering feature of the monitoring tool Periodic review and update of the policy

However, although the necessity for acceptable usage policies has been recognized by more and more institutions, these are still quite inconsistent. In an study performed by Palo Alto Networks on 20 large institutions covering around 350,000 users, it has been found that existing policies ran from completely absent, to existing several-year-old policies, to a fairly detailed policy that outlined specific applications and use cases [41]. Most often, the existing policies were not able to cover the ever-increasing range of Web-related applications used by the employees, and a number of them were present in a majority of the institutions, including circumventors (proxies and anonymizers), Web-based file sharing applications, instant messaging and Web mail, and many others. Web filtering and monitoring tools play a key role in the enforcement of these policies, as their filtering and reporting abilities may be used to limit Internet abuse. Also, these tools can be used to prevent the usage of emerging applications, which can be the source of more abuse. Moreover, the overwhelming majority of employees (92%) believe that their company has the right to install Web-filtering technology [62].

2.2

Children Protection

Internet and the Web have quickly become extremely useful tools for children and youngsters, either as a source of information and recreation, or as a communication tool that links them to their friends and family. Moreover, kids are especially active in Internet, as many of them have born after the emergence of the Web and they are more technology-friendly than many adults. However, children are a very sensitive Internet user group, since they are in the phase of shaping their mind, and they must be especially protected. Children face a number of risks in Internet, including [31]:

266 l

l

l

l

l

J. M. G. HIDALGO ET AL.

Being exposed to inappropriate or even illegal contents, like Web pages promoting food disorders (anorexia and bulimia), pornography, drugs promotion, bomb making and terrorism, hate speech and racism, sects, hacking, gambling sites, etc. Contact and abuse by online sexual predators, who take advantage of anonymity to get in touch with children, seduce them and even physically engaging them to do sexual acts Cyber-bullying and loss of reputation that includes sending hateful messages or even death threats to children, spreading lies about them online, making nasty comments on their social-networking profiles, or creating a Web site to bash their looks or reputation Illegal activities, which include the full range from being victims of online fraud to actively taking part into activities like hacking, sharing illegal content, and others Online addiction, especially to online gaming and sometimes to very damaging activities like gambling

For instance, children and teens often employ their Internet time to keep in touch with friends and other students by using instant messaging and social networks. In a survey conducted in 2004 for the Pew Internet and American Life Project [43], based on 1100 teens aged between 12 and 17, 75% of children that go online (971) do send or receive instant messages (736), and most of them (60%) do it daily or from three to five times a week. Moreover, 56% of the teens that use instant messaging have created a public profile, being exposed not only to already known people but to strangers. In another survey conducted for the same organization in 2006 [37], some 32% of online teenagers (and 43% of social-networking teens) report having been contacted online by complete strangers and 17% of online teens (31% of socialnetworking teens) have ‘‘friends’’ on their social network profile who they have never personally met. The necessity of protecting children has been recognized by governments and social institutions, which have enacted a number of regulations aiming at this goal. Representative regulations are: l

The USA Children’s Internet Protection Act [60], which requires schools and libraries that receive federal funds for discounted telecommunications, Internet access, or internal connections services to adopt an Internet safety policy and employ technological protections that block or filter certain visual depictions deemed obscene, pornographic, or harmful to minors.

WEB CONTENT FILTERING l

267

The European Union Convention on Cybercrime1 [11], which precisely defines child pornography-related criminal offences, and that defines a Europe-wide framework regarding this topic. This framework has been extended to racism and xenophobic crimes by the Additional Protocol to the Convention on Cybercrime, Concerning the Criminalization of Acts of a Racist and Xenophobic Nature Committed through Computer Systems [12].

International cooperation with local action is required for protecting children against most of the risks they face online, but apart from general frameworks like the Convention on Cybercrime, there are very limited possibilities due to the transnational nature of the Internet, and the existence of criminal paradises in countries with soft or nonexistent regulations. Most often, individual countries like France and Australia have issued particular laws that address children protection by even demanding the ISPs to provide parental controls and Internet filters to their customers. Protection of children in the Internet is not only a government issue, but mostly a parental issue. Most parents are concerned about the online risks and they actively face them. For instance, in a survey conducted by the Kaiser Family Foundation [47] on 1008 parents of children ages 2–17, two-thirds say they are very concerned about the amount of inappropriate media2 content children in this country are exposed to and many believe media is a major contributor to young people’s violent or sexual behaviors. In particular, nearly three out of four parents (73%) say they know ‘‘a lot’’ about what their kids are doing online (among all parents with children nine or older who use the Internet at home). Most parents say they check their children’s instant messaging ‘‘buddy lists,’’ look to see what Web sites they have been to after they go online, and review what their children have posted online. In sum, they seem to be taking advantage of the tools available to them to monitor what their children are doing online. Of course, children protection is not only a technology issue, and not only includes filters but hotlines, education, good practices, family contracts, and usage policies [9], but still filters have revealed as a major tool for complementing other approaches.

1 This kind of frameworks are to be signed by participating countries, which further have to ratificate them locally and put them into practice. Unfortunately, most often, major countries do not ratificate them. 2 In this study, ‘‘media’’ makes reference to TV, music, movies, gaming, and the Internet.

268

J. M. G. HIDALGO ET AL.

2.3

Internet Filtering and Free Speech

The increasing utilization of Web filters raises important concerns regarding free speech and censorship on the Internet. Nearly since their very beginning, the Internet and the Web have been open networks, but the publication of content was limited to technical persons. With the emergence of blogging and social networks, nearly everybody can have a Web presence. So, the Web is essentially democratic nowadays, and most people can find a vehicle in it for expressing their opinions and concerns, in short, a vehicle for the exercise of the First Amendment to the United States Constitution. The democratic nature of Internet has been protected by traditional organizations like the American Civil Liberties Union, and more specific institutions like the Electronic Frontier Foundation, or projects like the Open Net Initiative (ONI). For instance, ONI’s mission is ‘‘to identify and document Internet filtering and surveillance, and to promote and inform wider public dialogs about such practices.’’ In particular, the ONI has edited a book entitled ‘‘Access Denied: The Practice and Policy of Global Internet Filtering’’ [15] that covers political, social, and technical issues regarding Internet filtering, and presents a report on a number of countries that make use of filtering technologies to limit their citizens Internet access. For example, the authors of the book have found evidences that commercial Internet filters are being used in a number of countries: ‘‘Saudi Arabia uses SmartFilter as a filtering proxy and displays a block page to users when they try to access a site on the country’s block list. (. . .). United Arab Emirates, Oman, Sudan, and Tunisia also use SmartFilter’’ (Chapter 1). SmartFilter is an Internet filtering and monitoring tool by the United States-based corporation Secure Computing, which is a leading vendor in the educational United States market. Filtering and free speech is also an important concern in democratic countries. For instance, the Child Online Protection Act (COPA) [8] is a law in the United States of America, passed in 1998 with the declared purpose of restricting access by minors to any material defined as harmful to such minors on the Internet. The definition of harmful to minors in this regulation is ‘‘any communication (. . .) that is obscene or that a. the average person, applying contemporary community standards, would find, taking the material as a whole and with respect to minors, is designed to appeal to, or is designed to pander to, the prurient interest; b. depicts, describes, or represents, in a manner patently offensive with respect to minors, an actual or simulated sexual act or sexual contact, an actual or simulated normal or perverted sexual act, or a lewd exhibition of the genitals or post-pubescent female breast; and c. taken as a whole, lacks serious literary, artistic, political, or scientific value for minors.’’

WEB CONTENT FILTERING

269

Under this regulation, several cases have been opened against a number of Webmasters and corporations. In most of them, the defendants have argued that COPA is unconstitutional, and some of them have been absolved. Recently, after a succession of appeals, a court decision [2] has stated that ‘‘the Child Online Protection Act (. . .) facially violates the First and Fifth Amendments of the Constitution (. . .) (1) COPA is not narrowly tailored to advance the Government’s compelling interest in protecting children from harmful material on the World Wide Web (‘Web’); (2) there are less restrictive, equally effective alternatives to COPA; and (3) COPA is impermissibly overbroad and vague.’’ In conclusion, the federal courts have ruled that the law violates the constitutional protection of free speech, and therefore have blocked it from taking effect. There is a very difficult equilibrium between freedom of speech and other freedoms and protections. For instance, the conclusions of the (European) Expert Seminar on Combating Racism While Respecting Freedom of Expression [17] state that ‘‘freedom of expression and freedom from racism and racial discrimination are not conflicting, but complementary rights. We should keep in mind that human rights are interdependent and interconnected. This means that (i) there can be no such thing as two conflicting human rights and that, (ii) human rights need to be interpreted in light of each other.’’ Considering all these opinions and facts, our main conclusions are: l

l

l

Filtering at corporations is out of the freedom of speech versus censorship debate. The Internet connection is provided as a work tool, under the policies of a private company. Still, privacy is a concern. There are many evidences that filters are being used as censorship tools in undemocratic countries. Any regulation regarding children protection must balance its enforcement with free speech, and must not only rely on technical measures, but also on other as proposed by the COPA Commission.

In other words, filters are policy enforcement tools. If the policies are wrong, their usage can lead to censorship. If the policies are correct, they can be very useful.

3.

Web Filters Operation and Techniques

As Web filters have many possible scenarios, depending on the target users, institution organization, networks and carriers, etc., there are a number of operational issues to consider that we address in this section. Also, we also review the main techniques used in currently available filtering tools.

270

J. M. G. HIDALGO ET AL.

3.1

Operational Issues

The main operation issues we discuss are the dilemma between filtering and monitoring, available filtering categories, profiles and personalization, and network deployment.

3.1.1 Filtering Versus Monitoring A fundamental dilemma that organizations and parents have to address is whether to filter inappropriate contents, or just monitor Internet access. Moreover, both approaches can be combined as certain types of contents may be just blocked, while others can be monitored. The difference between filtering and monitoring is the following one: l

l

Filtering involves detecting a Web request, taking the decision about the suitability of the requested content (according to the defined policies), and sending the user the desired content, or blocking it by resetting the connection or sending an alternative content (a stop page). Monitoring consists of storing Web requests, always serving the desired content. The stored logs of activity can be later analyzed, to detect unacceptable patterns of behavior.

Both approaches can be combined to enforce policy compliance. For instance, a corporation may decide to block access to peer-to-peer (P2P) networks,3 as they are bandwidth-consuming applications rarely related to work; however, the corporation may just monitor the rest of Web access. In another example, a corporation may decide to filter out job search engines for all employees except for Human Resources workers, who may be monitored to avoid a personal use of these engines. Filtering is an intrusive and disrupting approach, as the users often perceive that the content is blocked (although the stop page may be simulating a network error). Also, the filtering tool may incur into false positives, which are appropriate contents classified as inappropriate, and blocked in consequence. For instance, many Web filters tend to classify sexual education sites as pornographic, and this may be an important concern for schools and libraries, or even a disaster for a health-related corporation.

3 Most P2P networks are usually blocked at the firewall level (discarding connections to specific ports). However, there is an increasing number of P2P and other applications that send their traffic through the port 80 (reserved to Web), and can be blocked by protocol analysis and detection.

WEB CONTENT FILTERING

271

FIG. 2. An example of stop page that allows requesting for the review of the requested content.

So, filters often send a block page that allows sending a request for the review of the blocked Web page, either to the filter manufacturer or to the system administrator. Figure 2 shows a typical stop page with this functionality. Obviously, false negatives (inappropriate Web pages that are not detected by the filter) are rarely reported by the end users, but supervisors and administrators can periodically screen logs and reports to detect these abuses, and ask for the review if needed. Monitors are less intrusive than filters, but they must be supported with very powerful analysis tools, which may allow detecting inappropriate behavior. Most often, these systems are used as research tools used to collect exhibits when there other evidences (as complaints by other users, as, e.g., the colleagues of a worker that disturbs them by abusing of online pornography). Monitors are dual to filters, and many security tools can perform both tasks at the same time, just by configuration.

3.1.2 Filtering Categories and Personalization An important question is: What is an inappropriate content? The kinds of inappropriate content may vary from organization to organization, and even from user to user in the same organization. For instance, blocking job search engines may be very sensible in a corporation, but not appropriate for a school. Or, some users may be allowed to see some contents (like teachers reviewing pro-Nazi Web pages), while others possibly should get the same requests blocked (like the students).

272

J. M. G. HIDALGO ET AL.

Personalization requirements in Web-filtering tools are increasing. Institutions specify Internet usage policies that define not only the contents to be blocked or monitored, but the user profiles and their privileges. These user profiles can depend, for example, on the position or the department in corporations. To define these profiles and enforce the appropriate policies, there are at least two requirements: 1. Filter policies and profiles must be easily deployed in multisite institutions. Many organizations are geographically distributed, like international corporations or networks of schools or libraries. In those cases, the departments or types of users may also be distributed, but policies ruling their privileges should be easily managed by system administrators and policy makers. In an optimal situation, multiple instances of a filter (one per site or station) would be collectively managed in a centralized fashion, from just one administration post. This does not imply that the policies and profiles are centralized, as current technologies allow spreading configuration changes in distributed systems. 2. Profile definition must be very flexible in terms of contents to be supervised. This problem is usually addressed by the definition of a wide number of categories that cover many types of content. For instance, the Optenet Web filter currently includes more than 50 categories, ranging from pornography, violence or sects to financial institutions, job search and directory, and street maps. This wide range of categories makes possible to deliver sophisticate profiles that can meet the needs of schools, libraries, the government, or corporations. Moreover, filters even include the definition of new categories by the system administrator, usually as lists of URLs pointing to user-defined sites. In Fig. 3, we show a typical category administration interface that allows to define new categories, to test URLs against current categories, or even to synchronize the categories with respect to the provider ones. Current Internet-filtering tools often address these topics and provide effective approaches to deal with them.

3.1.3 Network Deployment Filtering systems must currently face a number of challenges: organization locations can be distributed, filtering services can be provided at the carrier, users can demand station local filters, etc. In consequence, Web-filtering tools can be deployed in a wide variety of network scenarios, what dramatically affects their customization, performance, and security requirements. In Fig. 4, we present a number of scenarios and network points in which Internet filters can be deployed. These points are tagged with numbers, which correspond to network general locations. We discuss these scenarios and their properties in the next paragraphs.

WEB CONTENT FILTERING

273

FIG. 3. An example of category administration interface.

Internet 3 Provider SOC Carrier or Internet Service Provider 2 Organization network

4

2 Organization network

1

FIG. 4. Network points of filter deployment.

1

274

J. M. G. HIDALGO ET AL.

3.1.3.1 Filtering at the Workstation. Filters and monitors can be deployed at final user workstations, often as a part of a full security suite including antivirus, firewall, etc. Some vendors that offer these suites are Symantec, Trend Micro, Optenet, and others. Another flavor is safe browsers, tools that include filtering as the main functionality, like Nippy4 or KidSplorer.5 Even more, nearly all traditional Web browsers supply filtering functionalities among their security options. These solutions are typical choices for home users, and small and medium enterprises. This scenario is marked as 1 in Fig. 4, and it is often mentioned as an ‘‘endpoint’’ solution. This kind of deployment can hardly accommodate site-wide policies applying to all the computers, and requires individual configuration of every station. However, full customization of each workstation can be achieved. On the other side, performance requirements in terms of efficiency (processing time) are much less than in other options. Perhaps, the main weakness of this approach is its security. As filtering is performed in the local workstation, technology-savvy employees and kids often find ways to hack the system and access the blocked content. Also, they can even find the hack in the Web itself. 3.1.3.2

Filtering at the Institution Network. The filtering system can be deployed at the institution network. The system can be installed at a consumer (usually dedicated) server, or may be provided as appliance Modes of operation include bridging (the filtering server is put between the access point and the rest of the network), routing, and proxying (the server is put at the same level of other workstations, but it acts as a router or as a proxy server), among others. Vendors of appliances and software packages for network-level filtering include Optenet,6 WebSense,7 or IronPort,8 and open-source packages like DansGuardian9 or POESIA.10 This is a suitable choice for distributed organizations, which most often are big corporations or federations of schools. This operation point is marked as 2 in Fig. 4.

4

http://www.mynippy.net/. http://www.devicode.com/kidsplorer/. 6 http://www.optenet.com/. 7 http://www.websense.com/. 8 http://www.ironport.com/. 9 http://dansguardian.org/. 10 http://www.poesia-filter.org/. 5

WEB CONTENT FILTERING

275

Regarding customization, many filter vendors provide nowadays coordination mechanisms among servers installed in distributed locations. A number of providers offer unified administration consoles that allow the administrator to specify profiles and policies that apply to the whole ‘‘virtual’’ corporate network. This the reason why we draw a discontinuous line between two organization networks, as they may be located at different offices, possibly linked to different ISPs. Performance requirements for these filtering servers are stronger, as the system must be able to monitor the traffic of hundreds or thousands of concurrent users. In consequence, this mode of deployment requires dedicated high performance servers or appliances (typically with special network hardware). The filtering system is usually much less vulnerable in this scenario. Probably, the most dangerous attacks are physical (disconnecting the machine that hosts the filter) or of social engineering (getting the administration password by fooling the administrator).

3.1.3.3 Filtering at the ISP.

ISPs or carriers are always improving their commercial services to their customers. In particular, many current ISPs offer security services to their clients, including parental controls for home users, and full security services (firewalling, antivirus and antispam, Web filtering). In this case, the filters and other software products are installed at servers in the carrier operation centers, possibly as appliances. Quite frequently, the service is provided by the carrier using its own brand, being the filter vendor a private brand. Vendor of carrier-level filtering technology includes Optenet and Fortinet.11 This scenario is suitable for all kinds of institutions, and even for home users. The service is most often billed by subscription. This scenario is marked as 3 in Fig. 4. When services are provided in this way, they are typically regarded as ‘‘software as a service’’ (SaaS) [18]. No equipment or software installation is needed at the customer premises. It is often believed that SaaS does not allow much configuration by the user, but this belief is incorrect. For instance, a combination of a firewall and a Web filter can be managed remotely by the end user administrator, to implement the full suite of policies defined by the corporation or kid tutor. The possibilities of configuration only depend on the quality of the filtering solution and the kind of service that the ISP wants to offer.12

11

http://www.fortinet.com/. For instance, the carrier can offer a ‘‘silver’’ low-cost service that allows less configuration options than a ‘‘gold,’’ more expensive one, which enables the administrator to configure port blocking, categories, profiles, policies, etc. 12

276

J. M. G. HIDALGO ET AL.

Regarding performance, no doubt this is a very hard and challenging scenario. The servers that supports the service have to deal with even millions of concurrent users, with nearly no delay. This kind of filters are extremely optimized in terms of processing time, they are typically deployed in server farms, and have grid-like abilities including high scalability, redundancy, etc. From the point of view of security, these filters are far stronger than the previous ones. Carrier physical and software security measures are extreme, as the whole of their service depends on it.

3.1.3.4 Filtering as a Third-Party Service. An alternative deployment scenario is that in which the service is provided at a their-party network, commonly regarded as service ‘‘in the cloud.’’ In this case, a vendor distributes a network of (security) operation centers (SOCs) across even the world, and the customers send their Web traffic though these SOCs (for instance, by proxying). This operation mode can also be considered SaaS, as the customer gets the service without hardware or software licensing. An example of vendors is ScanSafe13 and WebSense. This scenario is suitable for all kind of organizations, including home users, but is most often targeted to small and medium enterprises. Filtering as a third-party service is pointed as 4 in Fig. 4. The discontinuous lines represent the fact that filtering can be performed at any level of an organization. All concerns about configuration, performance, and security are the same as carrier-level filtering, except perhaps for the fact that these services are weaker against distributed denial of service attacks. 3.2

Filtering Techniques

In this section, we describe the main techniques used in filtering and monitoring tools. We explicitly exclude port blocking, because it is usually implemented as a firewall-level service.

3.2.1 Self-Regulation Self-regulation consists on good practices that are implemented by content providers, and generally involve:

13

http://www.scansafe.com/.

WEB CONTENT FILTERING

277

1. A self-labeling system and policy used to describe the content in terms of its explicit nature, suitability for children, etc., and that is used by the content provider to tag their stuff. 2. A filter on the client side that recognizes content labels and match them with the own policies of the filter user, delivering or blocking the content as required. Popular labeling systems include PICS and ICRA. PICS [44] is a set of specifications created by the World Wide Web Consortium (W3C) to define a platform for the creation of content rating systems. It enables Web publishers to associate labels or metadata with Web pages to limit certain Web content with explicit nature targeted at adult audiences from reaching other groups of Internet users. ICRA (formerly the Internet Content Rating Association) is part of the Family Online Safety Institute, an international, nonprofit organization working to develop a safer Internet. The centerpiece of the organization is the descriptive vocabulary, often referred to as ‘‘the ICRA questionnaire.’’ Content providers check which of the elements in the questionnaire are present or absent from their Web sites. This then generates a small file containing the labels that is then linked to the content on one or more domains. The broad topics covered by the ICRA vocabulary are: l l l l l

l

The presence or absence of nudity The presence or absence of sexual content The depiction of violence The language used The presence or absence of user-generated content and whether this is moderated The depiction of other potentially harmful content such as gambling, drugs, and alcohol

Most popular browsers include security options regarding ICRA. For instance, Microsoft Internet Explorer includes a (password-protected) Content Advisor in which is possible to define what kind of content is can be displayed. In Fig. 5, we show the Content Advisor, with the category ‘‘Nudity & Sexual Material—Context Variable—Arts’’ selected. The bottom slider is used to define the privilege level. Also, the ICRA itself has developed an endpoint Web filter named ICRAFilter that makes use of this labeling system. However, the adoption of PICS and ICRA labels is not regulated and it is possible for some publishers to mislabel their Web content either by intent or by mistake. The existence of a third party reviewing the labeled contents is unfeasible. PICS and ICRA should therefore only be used as a supplementary tool in any Web-filtering system, as it is many commercial and open-source systems.

278

J. M. G. HIDALGO ET AL.

FIG. 5. Microsoft Internet Explorer Content Advisor with ICRA labels.

3.2.2 Listings This technique restricts or allows access by comparing the requested Web page’s URL (and equivalent IP address) with URLs in a stored list. Two types of lists can be maintained. A black list contains URLs of objectionable Web sites to block; a white list contains URLs of permissible Web sites. Most Web-filtering systems that employ URL blocking use black lists.

WEB CONTENT FILTERING

279

This approach’s chief advantages are speed and efficiency. A system can make a filtering decision by matching the requested Web page’s URL with one in the list even before a network connection to the remote Web server is made. However, this approach requires implementing a URL list, and it can identify only the sites on the list. Also, unless the list is updated constantly, the system’s accuracy will decrease over time owing to the explosive growth of new Web sites. Most Web-filtering systems that use URL blocking employ teams of human reviewers to actively search for objectionable Web sites to add to the black list. They then make this list available for downloading as an update to the list’s local copy. This is both time consuming and resource intensive. However: l

l

Filter vendors have defined internal protocols and deployed suitable tools that make the list updating process fast and effective. For instance, some filter provider make use of Web spiders (bots that recursively download Web pages by following their links) that automatically tag a number of heavily connected to other objectionable Web sites using the tags of these latter sites; after, the pages are sent to review by manual experts that are able to correct the tags in case of mistakes, but most often just have to validate the automatic classification. Efficacy not only depends on the list, but on the Web usage. As in many other domains, most popular Web sites accumulate a vast majority of visits, following a Zipf’s law distribution [6]. It is possible to achieve high performance with relatively small lists by closely studying users’ behavior and focusing on most popular URLs.

Another advantage of this approach is its fast and efficient operation, highly desirable in a Web-filtering system. Using sophisticated content analysis techniques during classification, the system can first identify the nature of a Web page’s content. If the system determines that the content is objectionable, it can add the page’s URL to the black list. Later, if a user tries to access the Web page, the system can immediately make a filtering decision by matching the URL. Dynamically updating the black list achieves speed and efficiency, and accuracy is maintained provided that content analysis is accurate. Because of these pros, nearly all commercial and open-source Web filters make use of this technology as their primary filtering technique. Current commercial URL lists include from 3 to 15 million items.

3.2.3 Keyword Matching The most primitive form of content analysis is keyword matching. This intuitively simple approach blocks access to Web sites on the basis of the occurrence of offensive words and phrases on those sites. It compares every word or phrase on a

280

J. M. G. HIDALGO ET AL.

retrieved Web page against those in a keyword dictionary of prohibited words and phrases. Blocking occurs if the number of matches reaches a predefined threshold. This fast content analysis method can quickly determine if a Web page contains potentially harmful material. However, it is well known for overblocking—that is, blocking many Web sites that do not contain objectionable content. Because it filters content by matching keywords (or phrases) such as ‘‘sex’’ and ‘‘breast,’’ it could accidentally block Web sites about sexual harassment or breast cancer, or even the home page of someone named Sexton. Although the dictionary of objectionable words and phrases does not require frequent updates, the high overblocking rate greatly jeopardizes a Web-filtering system’s capability and is often unacceptable. However, a Web-filtering system can use this approach to decide whether to further process a Web page using a more precise content analysis method, which usually requires additional processing time.

3.2.4 Intelligent Content Analysis Intelligent content analysis is an attempt at achieving semantic understanding of the Web contents. In particular, intelligent classification techniques can be used to categorize Web pages into different groups (e.g., pornographic and nonpornographic) according to the statistical occurrence of sets of features. This categorization if latter used by the system to decide whether to deliver the content or not according to the profiles and policies defined in terms of the available categories. The two most prominent content analysis technologies are text classification and image processing (discussed below), although there is some work on video processing (e.g., [33]). These techniques are always category dependent, that is, a specialized classifier must be built for each category, most often using machine learning (ML) approaches that learn the most interesting features of Web pages in the category. Moreover, specific techniques used to detect certain types of content (e.g., pornography) may be ineffective with respect to other types of content (hate speech), as it is the case of image processing: techniques used to detect skin areas are just not suited to the detection of Nazi symbols. The most important drawback of these techniques is their performance. Although it is possible to build quite efficient systems (most often decreasing the effectiveness of the tool), the overall processing time makes them inappropriate for the most demanding situations (like filtering at the carrier). However, if the intelligent content filter is called only when the URL is not in the vendor database, the number of requests that fire this component may be very small, and their results cached or inserted in the URL listings. Many current commercial tools include text analysis techniques, but because of performance, image processing is restricted to endpoint solutions or to offline (e.g., forensic) systems.

WEB CONTENT FILTERING

281

Another drawback is that building intelligent analysis tools does require special knowledge and expertise,14 and that it is difficult to correct system mistakes as the techniques are quite complex.

4. Text-Based Filtering Amongst the different techniques used for text-based Web content filtering, automated text categorization (TC) is currently the most widely used. The purpose of this task is to assign documents to a set of predefined categories (also named as classes or topics) [53]. Although Automatic Text Categorizers can be build by hand (e.g., by defining a set of heuristic rules), the complexity of Web content requires the automatic construction of these systems using an ML approach. This approach consists on training a text classifier using a set of manually labeled documents, and has proved to be as accurate as human experts. Complexity of Web content is defined in part by its structure, but a key point is the fact that authors are continuously adapting their contents to avoid filtering systems. From the point of view of machine learning, it can be considered as an Adversarial Classification problem [13], and since contents are mainly textual, it is defined as Adversarial Text Classification.

4.1

Text Classification Tasks

The aim of Text Classification is to provide structure to an unstructured repository of text, thereby easing storage, search, and browsing [54]. This discipline belongs to the broad field of text mining (TM) [26], or, more precisely, to Knowledge Discovery in textual databases. The first approach used successfully to face the problem of TC was knowledge engineering (KE), in the 1980s. A knowledge engineer had to build an expert system that could automatically classify text, but his lack of knowledge on the domain required the intervention of a domain expert. Moreover, the system had to be maintained by hand over time, making it a high-cost process in terms of human work. From the 1990s, the KE approach was substituted by the use of statistical techniques, making it a suitable problem for the field of statistical natural language processing (NLP). In this approach the classifier is built using a general inductive 14 These components have to be developed by experts in data mining, text classification, and image processing.

282

J. M. G. HIDALGO ET AL.

process trained with a set of example documents. The main advantages of NLP over KE are: l

l

l

l l

The high degree of automation, since the engineer develops an automatic builder of classifiers Reusability, because the automatic builder can be applied to the creation of many different classifiers for many different problems and domains just by changing the training set of documents Easiness of maintenance, since changes on the system only require changes on the training set and a new training process High availability (current and future) of inductive learning algorithms Accuracy of automatic classifiers usually outperforms those built by human experts

The number of Text Classification tasks has increased with the years, and several ways of organizing these tasks can be found on literature. According to Lewis [38], TC tasks can be classified using two axes: type of learning and granularity of text elements. The two types of learning are defined by the training set control: 1. Supervised learning. Set of classes is known when building the training set, and there are examples for each of the classes. 2. Unsupervised learning (clustering). Set of classes is unknown before training, and the goal is to group textual entities according to similar contents. Three levels of granularity can be defined, considering terms, phrases, or documents as atomic elements: 1. Terms. Ranging from words stems and single words to short expressions 2. Phrases. Going from clauses to complex sentences 3. Documents. Including short spam emails, medium-sized papers, or even whole books Table I shows some of the most representative TC tasks categorized according to these two axes. For instance, named entity recognition is the task of detecting proper names, temporal expressions, and quantities inside text documents [4]. It is a supervised task because all the possible entities are known from the beginning (names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.), and the elements considered are terms composed of one or few words. Text segmentation identifies a sequence of clauses or sentences that display local coherence within a document [34]. It is unsupervised because segments do not correspond to predefined classes, and the process is applied to phrases as text elements.

WEB CONTENT FILTERING

283

Table I AN ORGANIZATION OF TEXT CLASSIFICATION TASKS

Terms

Phrases Documents

Supervised learning

Unsupervised learning

Disambiguation Part of speech tagging Named entity recognition Partial chunking Automatic summarization Documents retrieval Text categorization

New meaning discovery

Text segmentation Documents clustering

Sometimes, these tasks are not the main goal of a TC system, and can act as a service for other tasks. For instance, part of speech (POS) tagging, named entity recognition, and disambiguation are often used as previous steps to text categorization to improve the quality of the attributes, assigning a suitable meaning to a term.

4.2 Text Categorization Types Text categorization admits two different taxonomies according to the number of categories defined and the degree of confidence on the decision taken. The first taxonomy differentiates between single-label and multilabel TC: single-label TC assigns only one category to a given document, while in multilabel TC a document may belong to zero, one or more than one category. The second is usually approached as a problem of deciding if a document belongs to each on the categories individually. The second taxonomy distinguishes between hard and soft categorization. Hard categorization consists of deciding whether a document definitely belongs to a category or not. On the other hand, soft categorization involves giving a numeric score that indicates the degree of confidence of the classifier to ensure that the document belongs to a category. Hard categorization is very useful to create rankings of documents in terms of their proximity to a given category.

4.3

Text Classification Process

According to Sebastiani [54], we can describe the TC process as consisting of four main phases: 1. Document indexing. Documents must be mapped to a compact representation of its content that can be directly interpreted both by a classifier-building

284

J. M. G. HIDALGO ET AL.

algorithm and by a built classifier. The most widely used representation is a vector of attributes that occur in the training set, each one with a value corresponding to the weight that it may have for the document. The initial set of attributes is usually comprised of the whole set of words that appear within the whole documents set, excluding a series of common words that are defined in what is called a stop list. In many cases these words are reduced to their stems (morphological roots). Weights are assigned using statistical heuristics that represent facts such as the number of times that a term occurs in a document (term frequency), or the number or documents that contain the term (inverse term frequency). Within the last few years, some research works on document indexing are beginning to use more complex attributes, either by grouping single words into words n-grams, parsing the text to obtain syntactic information, or by extracting concepts to represent the semantics of the text. However, they have not shown to have improved the standard representation of words. 2. Dimensionality reduction. The sizes of the vectors obtained after the first phase generally are in the order of tens of thousands or even hundreds of thousands, making efficiency of learners very hard to achieve. The second step involves reducing the length of these vectors to produce a new representation of documents. Most common techniques to produce dimensionality reduction are grouped either into feature extraction methods, such as latent semantic indexing [39] or term clustering [38]; or feature selection techniques, such as chi-square [64], information gain [38], or mutual information [16]. Feature extraction methods combine several dimensions into what will be a new single attribute in the reduced vector; feature selection techniques attempt to determine and select the best attributes from the original set, instead of generating new attributes. 3. Classifier learning. A general inductive process trained with a set of example documents automatically builds a text classifier. The representation for each document is that obtained after the second phase. Amongst the most popular supervised learners used for text categorization, we can cite probabilistic Bayesian models, Bayesian networks, decision trees, Boolean decision methods, neural networks, classifier ensembles, or support vector machines, but the number of techniques explored is longer [53]. These algorithms highly differ in terms of the type of model created, efficacy and efficiency, and capacity to manage huge amounts of attributes. Currently, support vector machines (SVMs) [29] and boosting [51] stand out from the rest, since they have outperformed competitors in different benchmarks and challenges. 4. Evaluation. The most important aspect for evaluation of classifiers is effectiveness, since it is very important to minimize the errors made by the system.

WEB CONTENT FILTERING

285

However, sometimes it is convenient to consider other measures related to efficiency, understandability of models, portability and scalability of techniques, etc.

4.4

Web Content Filtering as Text Categorization

Web content filtering can be faced as an adversarial text classification problem, since most content in Web pages is raw text. A critical aspect that must be considered comes from the fact that it is a multiclass problem: each type of content (pornography, racism, gambling games, etc.) corresponds to a category, and the classifier system must be capable of detecting if a Web page belongs to a determined category in order to allow different filtering profiles to system administrators. The most commonly used approach to face this problem is to build an independent classifier for each of the categories. The four phases for the TC process described above must be adapted to the characteristics of this problem as follows: 1. Document indexing. Web pages have a rich HTML format that is usually misused by most research works. However, some approaches make explicit use of it, such as the system described in Agarwal et al. [1], which considers seven specific sections for Web documents: URL, hyperlink tags, image tags, title, metadata, body, and tables. Words are separated according to the section, thus building vectors of vectors to represent Web pages. The vocabulary set is constructed using diverse approaches depending on both the page language and domain (pornography, violence, etc.). For instance, Lee et al. [35] build a vocabulary set of 55 words by hand to classify a Web page as pornographic. In Guermazi et al. [24], several dictionaries in several languages are manually constructed, with words that are likely to indicate if a Web page can be categorized within the violence domain. The most widely used representation is the vector of words, applying stop lists and stemming when the language supports it (usually occidental languages). Examples of it can be found on [1, 7, 22, 32]. Some approaches use part of speech tags, such as noun, adjective, or verbs [23, 59], as well as punctuation marks [23]. The use of words n-grams has also produced positive results [23, 42]. Weights used for attributes are from diverse nature, ranging from binaries [22, 59] to term frequencies [7, 23, 42] and to the combination of TF and IDF [1, 7, 32]. 2. Dimensionality reduction. Most works do not make use of dimensionality reduction, although there are some exceptions. For instance in Go´mez Hidalgo et al. [22], information gain is used to filter those terms that do not seem to be important for pornography detection, and Chou et al. [7] compare three quality

286

J. M. G. HIDALGO ET AL.

measures for three different filtering domains in the workplace (news, buys, and sports). 3. Classifier learning. A high number of ML methods have been tested for Web content filtering, but the lack of standard collections or competitive challenges does not allow achieving solid conclusions about the quality of those algorithms. The following can be found amongst the most commonly used algorithms: several versions of Naı¨ve Bayes [7, 22, 42], decision trees [7, 22, 24], lazy learners such as k-nearest neighbors [7, 59], neural networks [7, 35], and support vector machines [1, 7, 22, 23, 32, 42]. References [7] and [22] are the most exhaustive comparative studies, being the first about abuse in the workplace and the second about pornography. In the first study, C4.5 produces the best results, followed by kNN and SVM; in the second study, SVM proves to be clearly the most effective. However, since both domains and data sets are different, it is impossible to establish a comparison between them and extract any kind of conclusions. 4. Evaluation. Works presented in this field are totally heterogeneous, which makes it impossible to compare evaluation results. All researches make use of private sets of Web pages in very different domains (pornography, racism, violence, abuse in workplace, etc.), and in multiple languages (English, Spanish, Chinese, Italian, etc.). The lengths of the sets are variable, ranging from few hundred to several thousands. Moreover, evaluation metrics are diverse, including precision and recall, F-measure, accuracy and error, and the ROCHH method.

5.

Image Processing Techniques

Image processing has been a very active research field in the last 20 years, especially since the emergence and popularization of the Web and the increasing availability of image content on it. Among the domains to be filtered, the most popular one is by far pornography and naked people (see, e.g., [20, 25, 36, 52, 61]), and there is scarce work in other domains like hate speech (e.g., Nazi symbols [65]). Many existing techniques used for filtering adult content classify Web pages as porn or safe using their text content. However, those approaches have some limitations: they are very dependent on the language; they need pages containing enough text for a reliable classification and usually do not work with obfuscated texts [48]. Images are an essential part of World Wide Web. They are used to make sites more attractive to the visitor. But in adult sites, multimedia content can be the main element of the site. It is possible to find adult Web sites that are mainly composed by text, sites like blogs with content for adults, or Webs with erotic stories, but most

WEB CONTENT FILTERING

287

adult Web sites usually have a significant amount of images, many of them with explicit content. There are also sites where the percentage of text is very low. Sites like Thumbnail Gallery Posts (TGPs) usually have one or two lines of text and the rest of the Web page is composed of pictures. Those sites cannot be filtered with traditional texts filters, so effective filtering of images is very desirable in a filtering solution. Unfortunately, there are properties in objectionable images that make the problem very difficult [61]: l l l l l l l l l l

Most images contain nonuniform background. Foreground may contain textual noise like phone numbers, URLs, etc. Content may range from grayscale to 24-bit color. Some images may have a very low resolution. Views can be taken from a high variety of angles. May contain many people. People can have different skin colors. May contain both people and animals. May contain only some parts of a person. People may be partially dressed.

We review some image processing techniques that address pornographic image detection in the next sections.

5.1

Adult Image Recognition Using Skin Detection

One of the first and most popular techniques used for filtering images with naked people is the one proposed by Forsyth [20]. His approach looks for probable skin regions in the image and extracts groups and features from those regions. Those groups feed a geometric filter based on skeletal structure to identify human presence. In Fig. 6, we show an image before and after it has been processed to highlight skin areas. In Fig. 7, we present a diagram that shows the main processing steps in image analysis and classification. The first step in a system that identifies naked people present in images is to identify areas of skin using the color histograms of the images. The color of human skin in a picture is created by a combination of three factors: blood, melanin, and light conditions. The two first elements involve colors red, yellow, and brown and for that reason, skin colors are between those hues. Light conditions cannot be controlled but we can extract features that do not depend on that parameter.

288

J. M. G. HIDALGO ET AL.

FIG. 6. A nude picture before and after skin area detection.

5.1.1 Preprocessing Images For extracting skin areas, 8-bit RGB images are need. Since most images will be in JPG format, the first step is to reduce the number of colors, converting them to RGB format. Some other filters like scaling or noise reduction can also be applied in this step.

5.1.2 Skin Detection Skin is usually detected in a two step task. First, those pixels whose colors are very likely to be skin are selected. Then, we expand the selection to include those pixels with similar color and texture to pixels selected in the previous step. Next step is to locate groups of connected skin pixels. When connected components of pixels which are probably skin are over a certain threshold, usually ranging

WEB CONTENT FILTERING

289

RGB image

Prefiltering

Region extraction

Features vector

Classifier

Result class

FIG. 7. Main stages in image processing and classification.

from 50% to 60%, they are then extracted and grouped. The number of connected groups can be used as a feature for classification. These connected groups can also feed a geometric filter based on skeletal structure to identify human presence [20]. Other approaches for filtering objectionable images propose to complement the technique proposed above with many other features for detecting pornographic images. Jones and Rehg [30] propose to use the following features for the classification process: l l l l l

Percentage of pixels detected as skin Average probability of the skin pixels Size in pixels of the largest connected component of skin Number of connected components of skin Percentage of colors with no entries in the skin and nonskin histograms

Rowley et al. [48] propose to include another skin-independent features like image attributes (size, shape, etc.), Entropy features for distinguish adult-content

290

J. M. G. HIDALGO ET AL.

images from icons or banners, clutter features (amount of texture in skin regions), and face detection. This approach can classify 89% of the images with an average processing speed of 11 s per image [3].

5.2

Adult Image Recognition Using Wavelets

Although we can achieve very good results with the techniques commented in the previous section, those systems have a performance problem that make them unusable in real-world systems. The system proposed by Forsyth can take about 6 min to process an image using a workstation.15 Later refinements decrease the processing time, but they were taking over 1 min per image. A totally different approach for filtering adult images is the one used in Wang et al. [61] that uses a combination of different filters, including an icon filter, a graph photo detector, a color histogram filter, a texture filter, an a wavelet-based shape matching algorithm. That system is practical for a real-world implementation because it takes less than 2 s to process each image and achieves very good results. This system uses an algorithm that compares the semantic content of images containing human bodies. Using moment analysis, textures, histograms, and statistics, the algorithm produces a features vector that provides a high accuracy in recognition of nude human bodies in a picture. For the wavelet analysis, the approach uses Daubechies’ wavelets [14] that separates the image into clean distinct low-frequency and high-frequency parts. Daubechies’ wavelets are not as easy to implement as other, simpler wavelets (like Haar ones), but they are highly suitable for general-purpose images. In Fig. 8, we show an example of 2D Daubechies’ wavelet.

6. Evaluation of Web Filters The evaluation of Web filters is a prominent issue, given their increasing necessities of improvement. As user navigation patterns change over time, and the number of users is always growing, the only way to keep the filters effective is routinely performing tests to check their actual performance.

15

At the time of his writing, current workstations are thousand of times faster.

WEB CONTENT FILTERING

291

FIG. 8. Daubechies 20 2D wavelet.

We classify the evaluation of Web-filtering tools into two categories: 1. Industrial evaluations, quite often performed by analysts on demand of a commercial vendor, or by software magazines 2. Scientific evaluations, performed in the context of scientific works like those presented in previous sections covering intelligent content analysis We discuss the procedures, advantages, and drawbacks of both types of evaluations in the next sections.

6.1

Industrial Evaluation

An industrial evaluation is a test performed by a magazine or a third-party laboratory, which reviews a number of products and determines the strengths and weaknesses of the tested tools. The main advantage of industrial evaluations is that they usually cover the full range of features of the tools, to provide their readers a trustable opinion that may be used to take decisions about which tool to purchase. Examples of such reviews are: l

In [49], a group of 12 endpoint Internet filters for kids are reviewed by PC magazine analysts, focusing on the ability to block and monitor the coverage of instant messaging, the existence of connection time control, and the power of remote notification and management.

292 l

J. M. G. HIDALGO ET AL.

SC magazine periodically reviews enterprise market Web content filter. On its 2007 review,16 the features covered are ease of use, performance, documentation, support, and value for money.

Of course, the evaluation is much targeted to certain types of customers and solutions. For instance, in an endpoint solution for parental control, the following features should be covered: l

l

l

l

l

l

l l

Filtering algorithms (object analysis, URL based, keyword based, and dynamic categorization) Filtering capabilities (filter categories, editable filter lists, chat filtering, chat monitoring, chat blocking, newsgroup blocking, IM port blocking, peer-to-peer blocking, FTP blocking, customizable port blocking, email filtering, email blocking, popup blocking, predator blocking, and personal information blocking) Reporting capabilities (remote reporting, notification alerts by email, log reports sent by email, summary history reporting, detailed history reporting, graphical reporting, and logging of security violations) Management capabilities (individual user profiles, password controls, remote management, and stealth options) Other functionality features (immediate overriding of blocks, warning/not just blocking, daily time limits, negligible surfing time impacts, updated URL/ filtering rules, and blocking sensitivity settings) Help/support options (help, product documentation, and technical support available) Supported browsers (Internet Explorer, Netscape, FireFox, Opera, and Chrome) Supported platforms (Vista, XP, 2000, NT, Mac, and Linux)

The main drawbacks of industrial tests are their subjectivity and lack of rigor. For instance: l

l

l

16

Performance evaluation is reported in unknown conditions, including from the test set size and composition, to the hosting machines setup. Moreover, testing conditions may be favorable to a specific vendor. Performance measures are never supported with statistical tests. Criteria regarding a number of features (usability, scalability, etc.) are not under public review, and are rarely supported by real scenarios. Procedures are also private, and may also be unfair.

http://www.scmagazineus.com/Web-content-filtering-2007/GroupTest/10/.

WEB CONTENT FILTERING

293

However, these evaluations can be very helpful to make an initial screening of vendors, in order to take a purchase decision.

6.2

Scientific Evaluation

Scientific evaluations are those developed in the context of well-defined experiments supported by rigorous procedures and metrics, as usually in scientific papers. These evaluations are usually performed by personnel with scientific training, in laboratory conditions, and more importantly, the experiments are reproducible and the results comparable. We have reviewed above a number of papers covering a number of technical approaches to Web filtering. With respect to the evaluation reported in these papers, we conclude that: l

l

The only quality feature tested in the scientific literature regarding Web filtering is effectiveness or accuracy, or in other words, the degree of success that the system has when blocking inappropriate contents and allowing appropriate contents. The efficiency is only occasionally considered (most often in the case of image processing), although it plays a critical role in real-world conditions. Other features like scalability, portability, usability, etc., are vastly ignored. In the case of effectiveness, there is a lack of common data sets, procedures, and metrics. These features must be agreed by the scientific community, and the main vehicle to achieve this goal is the organization of rigorous competitive evaluations, as those performed in other domains like spam filtering. In this domain, the Text Retrieval Conferences have featured a track devoted to spam filtering [10], which has established common procedures and metrics, and disseminated standard data sets, which have promoted a considerable development of current techniques.

6.2.1 Performance Evaluation In any scientific evaluation of the effectiveness of a classification system like a Web filter, three main issues must be considered: l

l

Test sets, which are collections of URLs, Web pages, images, client requests, etc., correctly classified by human experts. The systems are feed with the contents of the test set, and their decisions compared to those of human beings. Collections should be public and standard. Procedures that define, for example, if the contents (e.g., URLs) are served to the classifier one by one or in batches, etc. The procedures must be defined with the goal of resembling real-world scenarios.

294 l

J. M. G. HIDALGO ET AL.

Performance metrics that fairly allow the comparison of several technical approaches or systems.

Unfortunately, nearly all the collections used in the papers reviewed in this chapter are private, and rarely shared with other researchers. Moreover, they hardly represent real-world situations, as they are composed of sets of items (URLs, HTML files, images, etc.) without user frequency information. As user requests are highly biased to a relatively small set of popular sites, these collections do not represent real-user behavior. Regarding procedures, all the studies reviewed make use of batch testing, consisting of presenting the full set of items to classify to the system, but not allowing it to learn from previous mistakes or hits. As in spam filtering, online methods that do allow learning while testing may better resemble operational environments [10]. Metrics of evaluation used in the literature are not standard either. The evaluation of effectiveness is aimed to estimate the quality of the classifier in terms of success and failure rates over a set of classified items. The metrics used have been adopted from the fields of information retrieval and machine learning. Table II shows a confusion matrix, representing possible outcomes of a binary (two categories) classification system when comparing its classification results to the correct ones (gold standard). Compared to that gold standard, retrieved items can be true positive (TP) if the classifier has identified a positive document as positive, false positive (FP) if the classifier has assigned positive to a negative document, false negative (FN) when a positive document is categorized as negative, and true negative (TN) if a negative document has been classified as negative. Related to these values, the most commonly used measures are [50]: l l

l l

Precision (P): proportion of items classified as positive that are really positives Recall (R): proportion of items classified as positive from the whole set of positive items Accuracy (A): proportion of items that have been well classified Error (E): proportion of items that have been classified incorrectly

Table II CONFUSION MATRIX FOR TWO CLASSES

Classifier ! Cþ Classifier ! C

Real ! Cþ

Real ! C

TP FN

FP TN

295

WEB CONTENT FILTERING

These metrics are defined by the following formulas: R¼

TP ; TP þ FN



TP ; TP þ FP



TP þ TN ; N



FP þ FN : N

One of the challenges when interpreting precision and recall is that there is usually a tradeoff between them: if a system tries to increase precision, recall will decrease, and vice versa. This has led to several ways to combine both factors, being the most widely used the F-measure [50], which can be calculated according to the following expression:  1 þ b2 RP  : Fb ¼ b2 P þ R The parameter b represents the relative value of precision: lower values represent more emphasis on precision, whereas higher values indicate more emphasis on recall. A value of b ¼ 1 is often used, giving the same weight to precision and recall. F1 is computed with the following formula: F1 ¼

2RP : RþP

When multiple categories are defined (e.g., pornography, violence, gambling, etc.), these measures must be averaged to some extent. It can be done in two ways: calculating the arithmetic average for all categories (macroaveraging), thus giving the same weight to all categories; or averaging by assigning a weight to each category in terms of the number of instances that it contains (microaveraging). Other alternative measures often employed in the context of Web page filtering are: l l

Overblocking: proportion of safe items that are blocked by the classifier Underblocking: proportion of unsafe items that incorrectly allowed by the classifier

The reader can easily discover that overblocking and underblocking can be computed as 1  P and 1  R, respectively. Some researchers have made the effort of trying to resemble some of the procedures in more standardized fields like spam filtering (e.g., [22]), by using, for example, the Receiver Operating Characteristic Convex Hull method, which provides a better understanding of the behavior of a classifier under imprecise conditions.

6.2.2 The Kaiser–Resnick Study Perhaps, the most influential and serious study regarding Web-filtering evaluation is that developed by Resnick and others [45, 46]. Under the deployment of the Children Internet Protection Act, the Kaiser Family Foundation commissioned a

296

J. M. G. HIDALGO ET AL.

team leaded by Resnick to perform a test of the effectiveness of commercial filters with respect to pornography versus health information. In a simulation of adolescent Internet searching, these researchers compiled the search results from 24 health information searches and six pornography searches. They manually classified the content of each site as pornography (516 sites), health information (2467 sites), or other (1004 sites). After, they tested six filtering tools commonly used in libraries and schools and one home product, each at 2 or 3 levels of blocking restrictiveness. At the least restrictive blocking setting, configured to block only pornography, the products tested blocked a mean of 1.4% of health information sites. However, the 10% of health sites found using some search terms related to sexuality (e.g., safe sex, condoms) and homosexuality (e.g., gay) were blocked. The mean pornography blocking rate was 87%. At moderate settings, the mean blocking rate was 5% for health information sites, and 90% for pornography. At the most restrictive settings, the mean blocking rate was 24% for health information sites, and 91% for pornography sites. The main positive issues of this experiment are: l

l

The test collection tries to resemble an operational environment that is youngsters searching the Internet for health information. Moreover, the test collection is public and available for researchers. The researchers made good and effective effort toward evaluating the tradeoff between overblocking and underblocking, by defining configuration scenarios with different levels of restrictiveness.

Unfortunately, the study is narrow (young people, health vs pornography) and the URLs used in it are outdated. However, this study represents by far the best practice in filtering effectiveness evaluation.

7.

Attacks and Countermeasures

In this section, we discuss a number of approaches that have been used to avoid filtering without detection. We do not cover physical attacks or hacking, as these are easily detected.

7.1

Disguising and Wrong Self-Labeling

Many Web filters block sites using a black list with URLs of adult sites. Some of those filters use only the domain name and not the IP address, so an easy way to bypass that kind of filters is to use the IP address instead of the usual URL address.

WEB CONTENT FILTERING

297

An attacker (i.e., a person that wants to visit a blocked adult Web site) can open a command prompt and make a ping to the blocked domain. When making ping, the IP address of the site is shown, so now he could go to the browser and type the IP address instead of the normal URL. Even if the Web filter blocked the IP address of the site, attackers can obfuscate the URLs. For example, they can take each number in the IP address and convert it to a hexadecimal format. Then in the browser enter: ‘‘http://0x(hex1).0x(hex2).0x(hex3).0x(hex4)’’ There are many scripts in the Web that will do this conversion. To avoid attackers to use these techniques to bypass the filters, content filters and/or deobfuscators for URLS must be implemented. Content publishers can also avoid Web content filters by disguising the content, using JavaScript and dynamically generated content. The content filter does not receive html text but JavaScript obfuscated code. Many filters cannot parse and interpret JavaScript code so they cannot classify the page as harmful so they are passed to the client. Another common practice used by Web adult publishers is to use safe labels to tag the content of their pages instead of the right ones. By this reason, filters based on labels are not very reliable.

7.2

Proxies

A circumventor is a method of defeating blocking policies implemented using proxy servers. Ironically, most circumventors are also proxy servers, of varying degrees of sophistication, which effectively implement ‘‘bypass policies.’’ By using an external proxy server, we could bypass a local filter. There are also several Web services that allow you browse anonymously, bypassing some network restrictions. By using those services, local clients make connections only to the server where the service is hosted, so any filters which block particular URLs can be bypassed because clients never have to communicate directly with the target server. There are many public proxies servers that can be used for browsing the net. To use those proxies for browsing, the attacker has to change the network properties of the Web browser specifying the proxy server address and the port. Because of this reason, many system administrators do not allow to change the connection properties of the browsers, so users without administrative privileges will not be available to use an external proxy. Unfortunately, there are versions of some browsers that can be taken in a pen drive (portable applications) that can be used without installing them in the computer, just running the directly from the pen drive. Users can change properties of those browsers, using external proxies without problems.

298

J. M. G. HIDALGO ET AL.

Even if an attacker cannot use an external proxy, there are still ‘‘home-made’’ techniques to simulate a proxy by using legitimate Web services like search engines or translation Web sites. When Google bots crawl the Web they store a copy of the content visited in Google’s servers. Then, those cached copies can be consulted using Google’s ‘‘cache:’’ operator. Since big search engines like Google are usually in white listings to avoid filtering them, an attacker can view objectionable content by viewing the cached versions of the Web sites. A similar strategy can be used with online translators that allow translating any Web page from one language to another, because an attacker would be visiting the translation service Web server instead of the original one. Translators can also be used to confuse content filters, just translating the Web site to a language not supported by the filter.

7.3

Anonymization Networks

Most Web content filters work analyzing the content transmitted to the hosts so they will not work if the traffic is encrypted or obfuscated. TOR17 is a software project aiming to protect its users against traffic analysis attacks. TOR operates an overlay network of onion routers that enable anonymous outgoing connections and anonymous ‘‘hidden’’ services. It also encrypts the data transmitted over the net, so content filters cannot analyze it. TOR uses a series of three proxies—computers (or nodes) which communicate on your behalf using their own identifying information, in such a way that none of them know both your identifying information and your destination. Luckily, TOR requires administrative privileges to be installed and configured properly for a safer navigation, so normal users will not be able to use this kind of software in school or workplace networks.

8.

Review of Singular Projects

There are many projects and solutions oriented to provide effective Web contentfiltering solutions. As we have previously seen, a relevant amount of research has been focused on designing algorithms and techniques able to process textual or graphic elements of Web content to classify it accordingly. We synthesize in this section aspects of three representative cases of research projects of increasing dimension. 17

http://www.torproject.org/.

WEB CONTENT FILTERING

8.1

299

Wavelet Image Pornography Elimination

The wavelet image pornography elimination (WIPE) system [61] developed by Wang, Li, and Wiederhold was motivated in the situation we have already depicted, in which families for instance have broader access to Internet and access of objectionable graphics by children is increasingly a problem that many parents are concerned about. WIPE was designed to classify an image as objectionable or benign. The system compares the semantic content of images mainly consisting of objects such as the human body. It uses a combination of an icon filter, a graph photo detector, a color histogram filter, a texture filter, and a wavelet-based shape matching algorithm to provide a decision about online objectionable pornographic images. Semantically meaningful feature vector matching is carried out so that comparisons between a given online image and images in a premarked trained data set can be performed efficiently and effectively. The combination of techniques used allows the system to face problems such us low quality of images, images containing more than one person, or only some parts of a person, and the different skin colors of the persons in one or several images. The system was used and evaluated with a training database of about 500 objectionable images and about 8000 benign images, and a test set of 1076 objectionable images and 10,809 benign images, demonstrating results of 96% sensitivity and 9% of wrongly classified benign photographs. This project has become the top reference regarding pornographic image processing techniques, and it is probably one of the most influential ones in the short history of Web filtering.

8.2

Public Open-Source Environment for Safer Internet Access

Public Open-Source Environment for Safer Internet Access18 (POESIA) [27] was a multisite project funded under the EU Internet Action Plan. The project included actions to develop, test, evaluate, and promote a fully open-source and extensible filtering software solution. POESIA provides an advanced Internet-filtering system, intended primarily for use in schools and other educational establishments, with the aim of providing safe and educationally appropriate Internet access for young people.

18

http://www.poesia-filter.org/.

300

J. M. G. HIDALGO ET AL.

POESIA’s approach is to use multiple filters each of which addresses some source of evidence that is of potential use in identifying harmful pages. The evidence detected can then be combined by a decision mechanism component to produce an overall decision for each page. In this way, the system can best exploit whatever information is available to determine which pages should be filtered. The POESIA filters include some that implement widely used filtering methods based on listed Web sites, but it promotes automatic content-based analysis of Web pages to achieve a broader coverage. The system includes filters addressing both image and textual content. The multiple filters of the system operate in combination. For example, a page from a site which is not in the URL lists will be analyzed for the content. If the page contains a reasonable quantity of text, this alone might allow a reject decision, but if there is limited text, it might require the combination of image and text evidence for a decision to be made. The decision mechanism plays an important role in weighting the available evidence to produce an overall judgment. For image-based filtering, POESIA includes the implementation of a detector to identify pornographic images exploiting a range of learning and image processing methods. It includes a maximum entropy model for skin detection. The output of skin detection is a grayscale skin map with the gray indicating the belief of skin. Some simple features are then calculated from the skin map and fit ellipses, and used to train a multilayer perceptron classifier with back propagation [65]. The detector is able to cope difficulties as variations of the skin colors and of the capturing conditions (illumination, camera, compression, noise, etc.), resulting specially practical compared with those existing systems in terms of processing speed. For textual content the system includes specific filters for different languages. The system includes filters for English, Spanish, and Italian. The filters differ in some methods they employ, partly reflecting an attempt to optimize over the different aspects of the languages. However, the filters are alike in offering both ‘‘light’’ and ‘‘heavy’’ filtering modes. Light filtering, which uses little NLP, provides rapid assessment of content for straightforwardly classifiable pages. For other pages, heavy filtering, making greater use of NLP is invoked to provide more sensitive detection of content indicators. Light filtering includes conventional statistical text classification techniques, using bag-of-words representation, with stop list and stemming, according to the vector space model (VSM) and linear support vector machine (SVM) classifiers. Heavy filtering makes a deeper analysis of the content including different linguistic features such as noun phrases recognized using POS, named entities, and some specific aspects depending of the language and different additional machine learning techniques [27]. The system was tested and evaluated by an end user team: Telefonica R&D and FCR (Spain), the software firm PIXEL (Italy), and the Liverpool Hope University (United Kingdom). There were considered different end user cases and educational contexts.

WEB CONTENT FILTERING

301

Some aspects were reaffirmed such as POESIA software should not be limited to filtering one language, filter a variety of content and allow flexibility for users to define the content that must be rejected. For the categories of contents filtered, pornography high, gross language medium and racism and violence low. The POESIA architecture readily allows for the inclusion of additional or substitute filters, and so the open-source character of the project allows for the continuing development of the system. We consider this project very important because it proposes an agent-like architecture, and a two-level filtering operation, that are still quite advanced. Also, its open-source nature makes an important difference.

8.3

NetProtect I and II

NetProtect19 and NetProtect II are projects partially funded by the European Commission under the Safer Internet Action Plan and related to the development of rating and filtering systems for Internet content. The NetProtect project (2001–2002) aimed at building a prototype of a third-party filtering solution able to filter out pornographic material found on Web pages expressed in either English, French, German, Greek, or Spanish. NetProtect II (2002–2003) was the follow-up project of NetProtect I. The overall objective of the NetProtect II project was to focus on improving and industrializing the NetProtect prototype in order to have a commercially available product by the end of this new project. Surf-mate was the final software solution commercialized based on the NetProtect components. NetProtect provides a solution for Internet access filtering dealing with pornography, and also violence, bomb-making, and drugs found on Web sites expressed in eight languages: Dutch, English, French, German, Greek, Italian, Portuguese, and Spanish. The NetProtect project also investigated tools able to filter not only Web pages, but also discussion while chatting on the Web or reading newsgroups or email. It follows a similar scheme to the previous POESIA project integrating white and black lists and assessing textual and graphic content of each page individually. Surf-mate is a resulting software tool from NetProtect. It finally combines all state-of-the-art techniques for classification of multimedia documents: l

Black/white list of URLs and keywords pattern detection mechanism to analyze URLs (thanks to Optenet20)

19

http://www.net-protect.org/. http://www.optenet.com/

20

302 l

l

J. M. G. HIDALGO ET AL.

Machine learning based on the fly text (thanks to the text classifier that was especially developed for the NetProtect II project and the topic classifier that has been developed for the previous NetProtect project) Real-time images classification (based on the F4i’s ICA component).

Perhaps, the first serious study of the effectiveness of existent filters that after guided the development of a commercial effective tool, makes this project a must know in the Web-filtering field.

9.

Conclusions and Future Trends

In this chapter, we have presented a review of state-of-the-art Web contentfiltering tools and techniques. The review is preceded by a motivation section that defines sensible usage scenarios of these tools, and discusses censorship and free speech issues. Also, we cover some attacks to filtering tools. After this review, we reach the following conclusions: l

l

l

From the point of view of usage, Web content filters are a support tool. They must be used to enforce suitable Internet usage policies that must be agreed between decision makers and the users, and supported by a wide variety of other measures including education and information. Filters can only be as bad as the policies, and on the other side, they can be very valuable and contribute to children protection in the Internet. Technically, Web filters have reached a very good degree of complexity and effectiveness, and they are routinely deployed in a variety of scenarios. However, they are not perfect and still make mistakes. Further improvement is required. To foster the required technical improvement, the research community has to agree with respect to evaluation procedures and metrics. Moreover, we believe that the best approach to deal with this is following the good practice in the spam filtering domain, which is setting up a competitive evaluation framework similar to the spam filtering one.

As a final note, we must remind the ever-changing nature of the Web and its users. This covers especially content creators and sexual predators. On one side, everyone can easily publish a Web page (e.g., a blog), and this freedom must be encompassed with the need of children protection. On the other side, the emergence of new

WEB CONTENT FILTERING

303

interaction tools (like, e.g., social networks like MySpace21 or Facebook,22 online games like World of Warcraft,23 virtual worlds like Second Life24 and Lively,25 or content streaming sites like YouTube) must be supervised closely; kids and adolescents are easily attracted by these tools, where they get exposed to sexual predators. Next-generation Web filters must be able to deal with these evolving hazards. References [1] Agarwal N., Liu H., and Zhang J., June 2006. Blocking objectionable Web content by leveraging multiple information sources. SIGKDD Explorations Newsletter, 8(1): 17–26. [2] American Civil Liberties Union, 2008. United States Court of Appeals for the Third Circuit, No. 072539, American Civil Liberties Union and others vs. Michael B. Musakey. Available at: http://www. ca3.uscourts.gov/opinarch/072539p.pdf. [3] Arentz W. A., and Olstad B., 2004. Classifying offensive sites based on image content. Computer Vision and Image Understanding, 94(1–3): 295–310. [4] Bikel D., Schwartz R., and Weischedel R., 1999. An algorithm that learns what’s in a name. Machine Learning, 34(1–3): 211–231. [5] Block J., 2008. Editorial: Issues for DSM-V: Internet addiction. American Journal of Psychiatry, 165: 306–307. [6] Breslau L., Cao P., Fan L., Phillips G., and Shenker S., 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of Infocom’99, IEEE Press, New York, NY. [7] Chou C., Sinha A., and Zhao H., 2008. A text mining approach to Internet abuse detection. Information Systems and E-Business Management, 6: 419–439. [8] Commission on Child Online Protection, 2000. Report to Congress. Available at: http://www. copacommission.org/report/COPAreport.pdf. [9] Consortium for School Networking, 2001. Safeguarding the Wired Schoolhouse: A Briefing Paper on School District Options for Providing Access to Appropriate Internet Content. White Paper available at: http://www.safewiredschools.org/pubs_and_tools/white_paper.pdf. [10] Cormack G., 2007. TREC 2007 spam track overview. In Proceedings of the 16th Text Retrieval Conference (TREC 2007), NIST Special Publication: SP 500–274. [11] Council of Europe, 2001. Convention of Cybercrime. ETS No. 185, available at: http://conventions. coe.int/Treaty/EN/Treaties/Html/185.htm. [12] Council of Europe, 2003. Additional Protocol to the Convention on cybercrime, concerning the criminalisation of acts of a racist and xenophobic nature committed through computer systems. ETS No. 189, available at: http://conventions.coe.int/Treaty/en/Treaties/Html/189.htm. [13] Dalvi N., Domingos P., Mausam S., Sanghai S., and Verma D., 2004. Adversarial classification. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, pp. 99–108. ACM Press, Seattle, WA. [14] Daubechies I., 1992. Ten lectures on wavelets. In Proceedings of SIAM’92.

21 22 23 24 25

http://www.myspace.com/. http://www.facebook.com/. http://www.worldofwarcraft.com/. http://secondlife.com/. http://www.lively.com/.

304

J. M. G. HIDALGO ET AL.

[15] Deibert R., Palfrey J., Rohozinski R., and Zittrain J., (Eds.) 2008. Access Denied: The Practice and Policy of Global Internet Filtering. MIT Press, Cambridge, MA. [16] Dumais S. T., Platt J., Heckerman D., and Sahami M., 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management, (G. Gardarin, J. C. French, N. Pissinou, K. Makki, and L. Bouganim, Eds.) pp. 148–155. ACM Press, New York, NY. [17] European Commission Against Racism and Intolerance, 2006. Expert Seminar: Combating Racism while respecting Freedom of Expression Proceedings, Strasbourg, 16–17 November 2006, available at: http://www.coe.int/t/dghl/monitoring/ecri/activities/22-Freedom_of_expression_Seminar_2006/ NSBR2006_proceedings_en.pdf. [18] Firstbrook P., 2007. Pros and Cons of SaaS Secure Web Gateway Solutions, Gartner Research, ID No. G00145299. [19] Fleck M., Forsyth D., and Bregler C., 1996. Finding naked people. In Proceedings of the 4th European Conference on Computer Vision, Cambridge, UK, pp. 593–602. [20] Forsyth D. A., and Fleck M. M., 1996. Identifying nude pictures. In Applications of Computer Vision, 1996, Proceedings 3rd IEEE Workshop on WACV, pp. 103–108. [21] Fox A., 2007. Caught in the Web. HR Magazine, 52(12): 35–39. [22] Go´mez Hidalgo J. M., Puertas Sanz E., Carrero Garcı´a F., and de Buenaga Rodrı´guez M., 2003. Categorizacio´n de texto sensible al coste para el filtrado de contenidos inapropiados en Internet. Procesamiento del Lenguaje Natural, 31: 13–20 (in Spanish). [23] Greevy E., and Smeaton A. F., 2004. Text categorisation of racist texts using a support vector machine. In Proceedings of JADT. [24] Guermazi R., Hammami M., and Hamadou A., 2007. Combining Classifiers for Web Violent Content Detection and Filtering Computational Science—ICCS 2007, pp. 773–780. [25] Hammami M., Chahir Y., and Chen L., February 2006. Webguard: A Web filtering engine combining textual, structural, and visual content-based analysis. IEEE Transactions on Knowledge and Data Engineering, 18(2): 272–284. [26] Hearst M., 1999. Untangling text data mining. In Proceedings of ACL’99: The 37th Annual Meeting of the Association for Computational Linguistics. [27] Hepple M., Ireson N., Allegrini P., Marchi S., Montemagni S., and Gomez Hidalgo J. M., 2004. NLP-enhanced content filtering within the POESIA project. In Proceedings of LREC. [28] Internet Security Systems, 2006. Proventia content analysis technology: An ISS whitepaper. Document No. PM-PROCONA-0506. [29] Joachims T., 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, (C. Ne´dellec and C. Rouveirol, Eds.), pp. 137–142. Springer-Verlag, Berlin. [30] Jones M. J., and Rehg J. M., 1998. Statistical color models with applications to skin detection, Technical Report CRL 98/11. [31] Kam K., 2007. 4 Dangers of the Internet. WebMD, available at: http://www.webmd.com/parenting/ features/4-dangers-internet. [32] Kim Y., Nam T., and Won D., 2006. 2-way text classification for harmful Web documents. In Computational Science and Its Applications—ICCSA 2006, pp. 545–551. [33] Kim C., Kwon O., Kim W., and Choi S., 2008. Automatic system for filtering obscene video. In ICACT 2008. 10th International Conference on Advanced Communication Technology, vol. 2, pp. 1435–1438. [34] Kozima H., 1993. Text segmentation based on similarity between words. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, 22–26 June 1993, Columbus, OH.

WEB CONTENT FILTERING

305

[35] Lee P., Hui S., and Fong A., 2002. Neural networks for Web content filtering. In Intelligent Systems, IEEE, vol. 17(5): pp. 48–57. [36] Lefebvre G., Zheng H., and Laurent C., 2006. Objectionable image detection by ASSOM competition. In Proceedings of the International Conference on Image and Video Retrieval, July 2006, (A. Z. Tempe and H. Sundaram et al., Eds.), Lecture Notes in Computer Science, 4071: pp. 201–210. Springer-Verlag, Berlin. [37] Lenhart A., and Madden M., 2007. Teens, Privacy & Online Social Networks. Pew Internet & American Life Project report, available at: http://www.pewinternet.org/pdfs/PIP_Teens_ Privacy_SNS_Report_Final.pdf. [38] Lewis D. D., 1992. Representation and learning in information retrieval. Ph.D. Thesis, Department of Computer Science. University of Massachusetts, Amherst, MA. [39] Li Y. H., and Jain A. K., 1998. Classification of text documents. The Computer Journal, 41(8): 537–546. [40] Lim V., Thompson S., and Loo G., 2002. How do I loaf here? Let me count the ways. Communications of the ACM, 45(1): 66–70. [41] Palo Alto Networks, 2008. The Application Usage and Risk Report. An Analysis of End User Application Trends in the Enterprise. White Paper available at: http://www.paloaltonetworks.com/ literature/whitepapers/Application_Usage_Risk_Report_April08.pdf. [42] Polpinij J., Chotthanom A., Sibunruang C., Chamchong R., and Puangpronpitag S., 2006. Contentbased text classifiers for pornographic Web filtering. In IEEE International Conference on Systems, Man and Cybernetics, 2006 (SMC’06), vol. 2, pp. 1481–1485. [43] Princeton Survey Research Associates International, 2004. Parents & Teens 2004 Survey. A survey conducted for the Pew Internet & American Life Project, available at: http://www.pewinternet.org/ pdfs/PIP_Teens_04_Questionnaire.pdf. [44] Resnick P., and Miller J., 1996. PICS: Internet access controls without censorship. Communications of the ACM, 39(10): 87–93. [45] Resnick P., Hansen D., and Richardson C., September 2004. Calculating error rates for filtering software. Communications of the ACM, 47(9): 67–71. [46] Richardson C., Resnick P., Hansen D., and Rideout V., 2002. Does pornography-blocking software block access to health information on the Internet? Journal of the American Medical Association, 288(22): 2887–2894. [47] Rideout V., 2007. Parents, Children & Media. A Kaiser Family Foundation Survey, available at: http://kaiserfamilyfoundation.org/entmedia/upload/7638.pdf. [48] Rowley H. A., Jing Y., and Baluja S., 2006. Large scale image-based adult-content filtering. In International Conference on Computer Vision Theory and Applications. [49] Rubenking N., 2008. 12 Tools to Keep Kids Safe Online. PC Magazine Security Product Guide, available at: http://www.pcmag.com/article2/0,2817,2272565,00.asp. [50] Salton G., and McGill M. J., 1983. Introduction to Modern Information Retrieval. McGraw-Hill Computer SeriesNew York, NY. [51] Schapire R., and Singer Y., 2000. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2–3): 135–168. [52] Schettini R., Brambilla C., Cusano C., and Cixocca G., 2003. On the detection of pornographic digital images. In Visual Communications and Image Processing 2003. Proceedings of the SPIE, (T. Ebrahimi and T. Sikora, Eds.), vol. 5150, pp. 2105–2113. [53] Sebastiani F., 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1–47.

306

J. M. G. HIDALGO ET AL.

[54] Sebastiani F., 2006. Classification of text, automatic. In The Encyclopaedia of Language and Linguistics, (K. Brown, Ed.), 2nd edn, vol. 14, pp. 457–462. Elsevier Science Publishers, Amsterdam, Netherlands. [55] Siau K., Fui-Hoon F., and Teng L., 2002. Acceptable Internet use policy. Communications of the ACM, 45(1): 75–79. [56] Simmers C., 2002. Aligning Internet usage with business priorities. Communications of the ACM, 45(1): 71–74. [57] Spink A., Jansen B. J., Wolfram D., and Saracevic T., 2002. From e-sex to e-commerce: Web search changes. IEEE Computer, 35(3): 133–135. [58] Stanton J., 2002. Company profile of the frequent Internet user. Communications of the ACM, 45(1): 55–59. [59] Su G., Li J., Ma Y., and Li S., 2004. Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model. Journal of Zhejiang University Science, 5(9): 1106–1113. [60] Victory N., 2003. Children’s Internet Protection Act. Pub. L. 106–554, Study of Technology Protection Measures in Section 1703, Report to Congress, available at: http://www.ntia.doc.gov/ ntiahome/ntiageneral/cipa2003/cipareport_08142003.htm. [61] Wang J., Li J., Wiederhold G., and Firschein O., 1998. System for screening objectionable images. In Computer Communications vol. 21(15), pp. 1355–1360. Elsevier Science Publishers, Amsterdam, Netherlands. [62] Websense, Inc., 2006. Web@Work Survey 2006. Conducted by Harris Interactive (available at http://www.websense.com/). [63] Wynn A., and Trudeau P., 2001. Internet abuse at work: Corporate networks are paying the price. SurfControl White Paper. [64] Yang Y., and Pedersen J. O., 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, (D. H. Fisher, Ed.), pp. 412–420. Morgan Kaufmann Publishers, San Francisco, CA. [65] Zheng H., Liu H., and Daoudi M., 2004. Blocking objectionable images: Adult images and harmful symbols. In 2004 IEEE International Conference on Multimedia and Expo, vol. 2, pp. 1223–1226.

ARTÍCULOS

Artículo 3: Categorización de texto sensible al coste para el filtrado de contenidos inapropiados en internet. Gómez Hidalgo, J. M., Puertas Sanz, E., Carrero García, F., Buenaga Rodríguez, M. (2003). Categorización de texto sensible al coste para el filtrado de contenidos inapropiados en internet. Procesamiento del Lenguaje Natural (Vol. 31, pp. 13-20). Impacto Este artículo ha sido publicado en la revista “Procesamiento del Lenguaje Natural”, revista de referencia en España para temas de Ingeniería del Lenguaje, que en 2012 aparece indexada en SJR en el segundo cuartil. Resumen El creciente problema del acceso a contenidos inapropiados de Internet se puede abordar como un problema de categorización automática de texto sensible al coste. En este artículo presentamos la evaluación comparativa de un rango representativo de algoritmos de aprendizaje y métodos de sensibilización al coste, sobre dos colecciones de páginas Web en español e inglés. Los resultados de nuestros experimentos son prometedores. Aportaciones En este artículo trabajamos también el problema del filtrado web pero nos centramos sobre todo en presentar el problema de los costes de los errores y su asimetría. Explicamos cómo influye al proceso y

191

ARTÍCULOS

detallamos la manera de realizar de forma adecuada una evaluación en este tipo de situaciones.

192

Categorizaci´ on de texto sensible al coste para el filtrado de contenidos inapropiados en Internet∗ Jos´ e Mar´ıa G´ omez Hidalgo Enrique Puertas S´ anz Francisco Carrero Garc´ıa Manuel de Buenaga Rodr´ıguez Departamento de Inteligencia Artificial Universidad Europea de Madrid Villaviciosa de Od´on, 28670, Madrid (Spain) {jmgomez,epuertas,fcarrero,buenaga}@.uem.es Resumen: El creciente problema del acceso a contenidos inapropiados de Internet se puede abordar como un problema de categorizaci´on autom´atica de texto sensible al coste. En este art´ıculo presentamos la evaluaci´on comparativa de un rango representativo de algoritmos de aprendizaje y m´etodos de sensibilizaci´on al coste, sobre dos colecciones de p´aginas Web en espa˜ nol e ingl´es. Los resultados de nuestros experimentos son prometedores. Palabras clave: Categorizaci´on autom´atica de texto, filtrado de Internet, aprendizaje sensible al coste, Receiver Operating Characteristic. Abstract: The access to inapropiate Internet content is an increasing problem that can be approached as a cost-sensitive Automated Text Categorization task. In this paper, we report a series of experiments that compare a representative range of learning algorithms and methods for making them cost-sensitive, on two Web pages collections in Spanish and English. The results of our experiments are promising. Keywords: Automated Text Categorization, Internet Filtering, Cost Sensitive Learning, Receiver Operating Characteristic.

1.

Introducci´ on

Es indudable que Internet y su implantaci´on progresiva en todos los ambientes de nuestra sociedad (la llamada Sociedad de la Informaci´on) acarrea notables beneficios para sus usuarios, posibilitando nuevas formas de comunicaci´on, trabajo y educaci´on. Sin embargo, la naturaleza inherentemente distribuida y de dif´ıcil control de Internet conlleva asimismo riesgos importantes para su aprovechamiento. En concreto, uno de los problemas m´as notables en la actualidad es el acceso a contenidos inapropiados por parte de ni˜ nos y j´ovenes, y por parte de los profesionales en su entorno de trabajo. Por una parte, existe la posibilidad de que, tanto activamente como pasivamente, los ni˜ nos y j´ovenes accedan a contenidos de Internet que son incapaces de juzgar correctamente, incluyendo los pornogr´aficos, o los que promueven la violencia, el racismo y la adhesi´on a sectas. Esta investigaci´ on ha sido financiada parcialmente por la Comisi´ on Europea a trav´es del Safe Internet Action Plan (POESIA - SIAP-2117) y por el Ministerio de Ciencia y Tecnolog´ıa a trav´es del programa PROFIT (FIT-070000-2002-861).



Por la otra, los trabajadores acceden a contenidos similares o de otro tipo (informaci´on sobre b´ usqueda de empleo, contenidos de entretenimiento, y otros) en su entorno laboral, haciendo uso de los recursos de la empresa con un fin para el que no est´an destinados, e incurriendo en abuso (ACM, 2002). En ambos casos, y ante la falta de regulaciones apropiadas, es adecuada la adopci´on de sistemas de filtrado y monitorizaci´on para limitar el acceso a contenidos inapropiados de Internet. Nuestra instituci´on interviene en dos proyectos de I+D llamados POESIA y TEFILA, orientados al desarrollo de herramientas para el filtrado de contenidos inapropiados en Internet en ambos entornos. En concreto, el objetivo de POESIA (Public Opensource Environment for a Safer Internet Access) es el desarrollo de un sistema de c´odigo abierto para el filtrado de informaci´on inapropiada en ambientes escolares, por medio de la integraci´on de t´ecnicas avanzadas de Ingenier´ıa del Lenguage Natural, Aprendizaje Autom´atico, procesamiento de im´agenes, y otras. Por otro lado, en TEFILA (TEcnicas de Filtrado basadas en Ingenier´ıa del Len-

guage, Aprendizaje autom´atico y agentes) se persigue el desarrollo de t´ecnicas m´as efectivas, flexibles y configurables que las actuales para la construcci´on de herramientas de filtrado en el puesto de trabajo. En nuestro trabajo, abordamos la detecci´on de contenidos inapropiados como un problema de categorizaci´on autom´atica de texto (Automated Text Categorization, ATC) (Sebastiani, 2002). Esta tarea consiste en la clasificaci´on autom´atica de documentos en categor´ıas predefinidas. En nuestro caso, y para la investigaci´on presentada en este trabajo, los documentos son p´aginas Web, y las categor´ıas son Pornograf´ıa y Seguro, en referencia al tipo de informaci´on m´as comunmente filtrada por las herramientas actuales. La construcci´on de sistemas de categorizaci´on autom´atica puede realizarse de manera manual (derivando conjuntos de reglas de clasificaci´on, al estilo de los sistemas expertos) o parcialmente autom´atica (usando t´ecnicas de Recuperaci´on de Informaci´on y de Aprendizaje Autom´atico para construir sistemas de clasificaci´on a partir de ejemplos manualmente clasificados). Este segundo enfoque, llamado en la literatura basado en aprendizaje (Sebastiani, 2002), involucra usualmente la representaci´on de los documentos (p´aginas Web) como vectores de pesos de t´erminos, sobre los que se aplican algoritmos de aprendizaje que extraen modelos de clasificaci´on llamados clasificadores. En la actualidad, los clasificadores obtenidos por estos medios son casi tan precisos como los seres humanos, especialmente en la categorizaci´on tem´atica (usando sistemas de categor´ıas orientados al tema de un documento, como los bibliogr´aficos o los de los directorios Web como Yahoo!). Sin embargo, los sistemas usuales de ATC basados en aprendizaje usualmente no tienen en cuenta que los distintos tipos de errores que puede cometer el sistema tiene costes distintos para el usuario. En nuestro caso, es m´as perjudicial que un ni˜ no acceda a un contenido pornogr´afico (un error por defecto) que el que no pueda acceder a un contenido v´alido (error por exceso). Para resolver este problema, es necesario aplicar m´etodos de aprendizaje sensibles al coste de los errores, que son capaces de tener en cuenta estos costes asim´etricos para derivar clasificadores que prefieren unos tipos de errores sobre otros. En este trabajo presentamos unas series de experimentos orientadas a evaluar la efec-

tividad de una gama representativa de algoritmos de aprendizaje (incluyendo los clasificadores bayesianos ingenuos, la inducci´on de a´rboles de decisi´on con C4.5, la generaci´on de reglas de clasificaci´on y las Support Vector Machines) adaptados al coste por medio de distintas estrategias (basada en umbral, basada en pesos y MetaCost), para la detecci´on de contenidos pornogr´aficos en el WWW, en ingl´es y castellano. Los distintos enfoques han sido evaluados utilizando el m´etodo Receiver Operating Characteristic Convex Hull, que resulta el m´as adecuado cuando los costes de los errores son asim´etricos pero desconocidos en condiciones reales. El resto de este trabajo est´a organizado de la siguiente manera. En primer lugar, presentamos el modelo de categorizaci´on basado en aprendizaje sensible al coste. En segundo lugar, describimos en m´etodo de evaluaci´on usado en nuestros experimentos. A continuaci´on, describimos el entorno experimental, detallando los enfoques evaluados y la colecci´on de evaluaci´on. Despu´es se presentan y analizan los resultados de nuestros experimentos, para finalmente extraer las conclusiones y esbozar las lineas de trabajo futuro.

2.

Categorizaci´ on autom´ atica sensible al coste

La ATC basada en aprendizaje es hoy en d´ıa un modelo de categorizaci´on s´olido y efectivo. Usualmente, este modelo no tiene en cuenta que la clasificaci´on de contenidos inapropiados es un problema con costes asim´etricos, es decir, en el unos tipos de errores son m´as da˜ ninos que otros. En este apartado presentamos el modelo general de la ATC basada en aprendizaje, y la adaptaci´on del mismo a entornos en los que los costes de los errores son asim´etricos.

2.1.

Categorizaci´ on autom´ atica basada en aprendizaje

La construcci´on de sistemas de ATC basada en aprendizaje se basa en diversos elementos tomados de los campos de la Recuperaci´on de Informaci´on y del Aprendizaje Autom´atico (Sebastiani, 2002). Esencialmente, el proceso consiste en la representaci´on de un conjunto de documentos (en nuestro caso, p´aginas Web – documentos en formato HTML) manualmente clasificados (llamado colecci´ on de entrenamiento) como vectores de pesos de t´erminos, a los que se aplica

un algoritmo de aprendizaje que construye un modelo de clasificaci´on o clasificador. Los documentos a clasificar se representan de igual forma, de modo que el clasificador es cap´az de asignar una categor´ıa (en nuestro caso, Pornograf´ıa o Seguro) a los mismos. Este modelo est´a basado en diversos elementos: 1. El m´etodo de representaci´on o indexaci´ on de los documentos. El m´as usual consiste en representar los documentos como vectores de pesos de t´erminos, seg´ un el Modelo del Espacio Vectorial para la Recuperaci´on de Informaci´on (Salton, 1989). En este modelo, llamado en la literatura “bag of words”, los t´erminos son normalmente palabras aisladas a las que se aplica un extractor de ra´ıces como el de Porter, una vez que se han eliminado aquellas m´as frecuentes o con menor carga sem´antica usando una lista de parada. Los pesos de los t´erminos en cada documento se puede definir de varias maneras, incluyendo los pesos binarios (1 si el t´ermino aparece en el documento, y 0 en caso contrario), los pesos TF (Term Frequency o frecuencia del t´ermino en el documento), o los pesos TF.IDF (el anterior multiplicado por la Inverse Document Frequency o frecuencia inversa en documentos, definida usualmente como log 2 (n/df (t)), donde n es el n´ umero de documentos de entrenamiento, y df (t) el n´ umero de documentos en los que aparece el t´ermino t). 2. El m´etodo de selecci´on de t´erminos. Con el fin de evitar el sobre ajuste en el aprendizaje, y para aumentar su eficiencia y efectividad, se suele seleccionar un subconjunto de t´erminos de los originales. Para ello, se utiliza una m´etrica de calidad de los t´erminos, y se seleccionan aquellos cuyo valor para la misma es alto. En (Yang y Pedersen, 1997) se demuestra experimentalmente que la utilizaci´on de las m´etricas Ganancia de Informaci´on (Information Gain, IG) y χ2 permite eliminar hasta un 99 % de los t´erminos originales, logr´andose un importante aumento de la eficiencia e incluso un ligero aumento de la efectividad. 3. El algoritmo de aprendizaje. Son m´ ultiples los algoritmos de aprendizaje aplicados a problemas de ATC en la literatura, incluyendo los clasificadores pro-

babil´ısticos como el bayesiano ingenuo, la inducci´on de a´rboles de decisi´on con C4.5, la generaci´on de reglas de clasificaci´on con Ripper, las Support Vector Machines – SVM, y otros (ver la recopilaci´on de (Sebastiani, 2002), y la comparativa de (Yang, 1999)). La efectividad de los algoritmos es variable, siendo las SVM uno de los m´as efectivos. Este modelo es muy efectivo en situaciones en las que la categorizaci´on es tem´atica, es decir, se pretende asignar uno o varios temas a un documento en funci´on de su contenido. En (Sebastiani, 2002) se argumenta que, hoy por hoy, la ATC basada en aprendizaje es capaz de alcanzar grados de precisi´on similares a los del ser humano.

2.2.

Aprendizaje sensible al coste para la categorizaci´ on de texto

La mayor´ıa de algoritmos de aprendizaje anteriores, por su propia naturaleza, buscan minimizar el n´ umero de errores del clasificador generado. Sin embargo, son m´ ultiples los problemas de Aprendizaje Autom´atico (y de ATC) en los que los errores cometidos por el clasificador generado no tienen la misma importancia (Provost y Fawcett, 2001). En concreto, en la ATC orientada al filtrado de correo masivo no solicitado, o spam, es preferible que el sistema clasifique un mensaje no solicitado como leg´ıtimo antes que lo contrario (G´omez, 2002). Ello se debe a que es probable que el usuario del sistema elimine los no solicitados sin un examen excesivamente detallado, corriendo el riesgo de eliminar mensajes leg´ıtimos e importantes. En t´erminos de costes, y asumiendo que la clase positiva (a detectar) es el correo masivo no solicitado, se dice que un error de tipo Falso Positivo (clasificar un mensaje leg´ıtimo como masivo no solicitado) tiene mayor coste que un error de tipo Falso Negativo (el opuesto). Esta situaci´on se produce tambi´en en la clasificaci´on de p´aginas Web pornogr´aficas. En entornos escolares, y debido a las posibles consecuencias sobre los ni˜ nos y j´ovenes, es preferible errar por exceso (clasificando p´aginas seguras como pornogr´aficas) que por defecto (lo contrario). Por tanto, es necesario aplicar algoritmos de aprendizaje que sean sensibles al coste de los errores, y que al construir el clasificador prioricen evitar unos tipos

de errores sobre otros. Los m´etodos de aprendizaje sensibles al coste suelen ser adaptaciones de algoritmos existentes, como los a´rboles de decisi´on (Ting, 1998) y otros. Sin embargo, existen estrategias que son independientes del algoritmo de aprendizaje utilizado. Estas estrategias, corrientemente denominadas de metaesquemas de aprendizaje, toman como entrada un algoritmo de aprendizaje, una colecci´on de entrenamiento y una distribuci´on de costes, y generan un clasificador basado en el algoritmo de aprendizaje y adaptado a los costes de los errores. Ejemplos de estas estrategias son: La basada en umbral (Thresholding – (Witten y Frank, 1999)), que se puede aplicar a todo algoritmo cuya salida sea un clasificador que emite valores num´ericos (como probabilidades, similitudes, etc.). La idea es simple: si, por ejemplo, un clasificador Φ(d) asigna la clase positiva a un documento d a partir de un umbral ν (es decir, cuando Φ(d) > ν), el umbral se ajusta para que el clasificador sea m´as o menos conservador, usando para ello una subcolecci´on de documentos de entrenamiento reservados para este fin. La basada en pesos de ejemplares (Instance Weighting – (Ting, 1998)), aplicable a cualquier algoritmo de aprendizaje. Esta basada en dar m´as peso a los documentos o ejemplares de una clase (e.g. Pornograf´ıa), a fin de que el algoritmo se concentre especialmente en clasificar correctamente estos ejemplares, minimizando el error sobre ellos. El peso asignado es proporcional al coste de los errores sobre dichos documentos. MetaCost (Domingos, 1999), aplicable a cualquier algoritmo de aprendizaje. Esta sofisticada t´ecnica consiste en reetiquetar la colecci´on de entrenamiento de acuerdo con la salida de un comit´e de clasificadores generados por el algoritmo base usando el m´etodo de bagging, y entrenar luego un clasificador sobre la colecci´on reetiquetada. La aplicaci´on de estos m´etodos permite adaptar al coste los algoritmos empleados tradicionalmente en la ATC, generando clasificadores m´as efectivos en situaciones en las

que la distribuci´on de costes en los errores es asim´etrica.

3.

Evaluaci´ on de la categorizaci´ on sensible al coste

La evaluaci´on de los sistemas de ATC se basa usualmente en dos elementos principales: la disposici´on de una colecci´on de evaluaci´on, y la utilizaci´on de m´etricas de efectividad. Las m´etricas contabilizan el ´ındice de aciertos y errores de un clasificador sobre la colecci´on de evaluaci´on.

3.1.

Colecciones de evaluaci´ on

Una colecci´on de evaluaci´on es un conjunto de documentos manualmente clasificados sobre los que se eval´ ua un sistema de ATC. El ejemplo arquet´ıpico de colecci´on de evaluaci´on para la ATC es la colecci´on Reuters21578, que contiene noticias en ingl´es clasificadas de acuerdo con categor´ıas tem´aticas de car´acter econ´omico (indicadores econ´omicos, monedas, bienes, etc.) (Sebastiani, 2002). Las colecciones de evaluaci´on suelen dividirse en dos partes: un fragmento de la colecci´on se reserva para el entrenamiento, y otro fragmento se emplea en la evaluaci´on. En Reuters se han realizado hasta 4 particiones, usadas en distintos trabajos. Una alternativa a este enfoque, frecuente en Aprendizaje Autom´atico, es la validaci´on cruzada (k-fold cross validation). Dado un n´ umero natural k (frecuentemente 10), la colecci´on se divide en k partes, de modo que cada parte mantiene la misma distribuci´on de documentos en clases. Se realizan k pruebas con k − 1 partes de entrenamiento y 1 de evaluaci´on, y se promedian los resultados. A falta de una partici´on s´olida de la colecci´on de evaluaci´on, ´este es el enfoque m´as razonable.

3.2.

M´ etricas de efectividad

Las m´etricas de efectividad m´as populares en la evaluaci´on de sistemas de ATC proviene del campo de la Recuperaci´on de Informaci´on, e incluyen la tasa de recuperaci´on o recall, la precisi´on y la medida F1 , que combina ambas (Sebastiani, 2002). Estas m´etricas no son adecuadas en situaciones en que los costes son asim´etricos. Medidas como la exactitud con pesos (Weighted Accuracy) o el coste (asumiendo 0 para los aciertos, y el coste para cada error) son m´as adecuadas en situaciones en que los costes asociados a los errores son conocidos.

Sin embargo, los costes reales raramente son conocidos en el mundo real, y adem´as ´estos pueden cambiar de unos entornos a otros. Por ejemplo, en una escuela se puede priorizar que se bloquee el acceso a p´aginas pornogr´aficas a riesgo de bloquear p´aginas seguras, mientras que en una empresa dedicada a la medicina puede preferir lo contrario. En situaciones de costes reales desconocidos, como la que se da en el filtrado de pornograf´ıa en Internet y en la detecci´on de correo masivo no solicitado, resulta m´as adecuado realizar la evaluaci´on usando el m´etodo Receiver Operating Characteristic Convex Hull – ROCCH) (Provost y Fawcett, 2001), que permite comparar enfoques cuando los costes son desconocidos pero relevantes, y seleccionar el m´etodo m´as adecuado una vez que estos se fijan. Describimos este m´etodo a continuaci´on1 .

3.3.

El m´ etodo ROCCH

El m´etodo ROCCH ya ha sido utilizado en ATC para la evaluaci´on de sistemas de detecci´on de correo masivo no solicitado (G´omez, 2002). Este m´etodo parte de construir gr´aficas tipo Receiver Operating Characteristic (ROC) para los clasificadores evaluados. Un gr´afica ROC es similar a una gr´afica recallprecisi´on, en la que se representan en el eje de abscisas la tasa de falsos positivos (False Positive Rate – FPR, definida como el porcentaje de ejemplares clasificados en la clase positiva perteneciendo a la clase negativa), y en el de ordenadas la de positivos reales (True Positive Rate – TPR, definida como el porcentaje de ejemplares clasificados en la clase positiva de los pertenecientes a la misma). Para cada clasificador sensible al coste, y cada distribuci´on de costes, se puede obtener un punto (FPR, TPR) que se representa en una gr´afica ROC. Cuanto m´as cerca de la esquina superior izquierda se halla el punto, mejor es el clasificador. Se puede obtener una curva uniendo los puntos obtenidos para distintas distribuciones de costes, asumiendo interpolaci´on lineal. La comparaci´on entre dos gr´aficas se realiza de igual modo que en las gr´aficas recall-precisi´on. Por resultados te´oricos, se pueden descartar los puntos de un diagrama ROC que no se encuentran en el recubrimiento convexo su1

Por razones de espacio, hemos de ser breves en la descripci´ on del m´etodo ROCCH. Para conocerlo en m´ as profundidad, v´ease (Provost y Fawcett, 2001).

perior (upper Convex Hull) del conjunto de puntos representados. Asimismo, dada una distribuci´on de clases y de costes de cada tipo de error (que corresponda a una situaci´on real concreta), es posible seleccionar el clasificador o clasificadores m´as eficaces en dichas condiciones. Ello se debe a que a cada conjunto de condiciones de distribuciones de clases y costes se corresponde con la pendiente de una recta que se puede desplazar de abajo a arriba, para encontrar el punto superior que maximiza la efectividad, o de otro modo, minimiza el coste. Operativamente, el m´etodo ROCCH consta de los siguientes pasos: 1. Para cada m´etodo de clasificaci´on sensible al coste, obtener una curva ROC del modo siguiente: (a) entrenar y evaluar el clasificador para un conjunto de distribuciones de costes representativas, obteniendo una serie de puntos (FPR,TPR) 2 ; y (b) obtener el recubrimiento convexo superior de la serie de puntos, descartar los que no forman parte de ellos, y unir los siguientes por medio de rectas (interpolaci´on lineal). 2. Obtener el recubrimiento convexo superior de todas las curvas representadas. T´ıpicamente, unos algoritmos superar´an a otros en un rango de abscisas, pues es dif´ıcil que un enfoque sea absolutamente superior a los dem´as. 3. Hallar el rango de pendientes para las cuales cada curva coincide con el recubrimiento convexo. De este modo, se obtiene un cuadro que define bajo que condiciones es mejor cada clasificador. 4. En caso de conocer la distribuci´on de clases y costes del entorno operativo real, obtener la pendiente asociada a dichas distribuciones y seleccionar el mejor clasificador disponible seg´ un este an´alisis. El m´etodo ROCCH permite la comparaci´on visual de la efectividad de un conjunto de clasificadores, de manera independiente de la distribuci´on de clases y costes(Provost y Fawcett, 2001). De este modo, la decisi´on de 2 Cada punto se obtiene evaluando sobre una colecci´ on de evaluaci´ on. Para que los resultados sean m´ as fiables, y ante la falta de una partici´ on estable de la colecci´ on de evaluaci´ on, es recomendable obtener cada punto por validaci´ on cruzada.

cluyendo aquellas que tardaban m´as de 10 segundos en responder, obteniendo 4.956 documentos v´alidos y 966 para adultos en espa˜ nol, y 2.570 documentos v´alidos y 129 para adultos en ingl´es.

cual es el mejor m´etodo se puede retrasar hasta que las condiciones operativas reales sean conocidas, obteniendo al mismo tiempo informaci´on valiosa.

4.

Dise˜ no experimental

En esta secci´on describimos en detalle los experimentos realizados, centr´andonos en la construcci´on de la colecci´on de entrenamiento y evaluaci´on, y en los enfoques evaluados.

En nuestra experiencia, estos datos pueden ser suficientes para realizar un entrenamiento bastante efectivo en la detecci´on de contenidos pornogr´aficos en ingl´es y espa˜ nol.

4.1.

4.2.

La colecci´ on de evaluaci´ on

Hemos tomado como base el directorio Open Directory Project (ODP)3 . Un directorio WWW es un conjunto de recursos organizados en categor´ıas ordenadas jer´arquicamente. El directorio WWW m´as popular en la actualidad es el inclu´ıdo en el portal Yahoo!. Sin embargo, el ODP es el directorio de mayor tama˜ no de los existentes actualmente. En el ODP figuran m´as de 3,8 millones de recursos (p´aginas WWW, grupos de noticias, servidores FTP, etc) organizados en m´as de 460.000 categor´ıas, y mantenido por casi 56.200 editores humanos de manera gratuita. A diferencia de Yahoo! y otros, este directorio es libre en varios sentidos, pero especialmente en que es posible descargar la totalidad de sus contenidos sin restricciones. El ODP puede ser usado para la elaboraci´on de colecciones de p´aginas WWW pornogr´aficas y v´alidas en m´ ultiples idiomas, pues sus contenidos est´an separados regionalmente, y organizados en secciones para adultos y para el p´ ublico general. En concreto, el ODP contiene aproximadamente 2,5 millones de referencias en ingl´es v´alidas y 100.000 para adultos. En las secciones sobre Espa˜ na, el ODP contiene aproximadamente 100.000 referencias v´alidas y 1.000 para adultos. Utilizando un procesador y un robot software programados al efecto, hemos construido una colecci´on de referencias del siguiente modo: Hemos recolectado todas las direcciones v´alidas y para adultos en ingl´es y en espa˜ nol, y hemos seleccionado de manera aleatoria un subconjunto de ellas, obteniendo 5.335 direcciones v´alidas y 1.021 direcciones para adultos en espa˜ nol, y 5.091 direcciones v´alidas y 1.002 direcciones para adultos en ingl´es. Hemos descargado todas las p´aginas accesibles en las direcciones anteriores, ex3

http://dmoz.org.

Procesamiento de la colecci´ on

Los documentos se han representado usando el Modelo del Espacio Vectorial (Salton, 1989), como vectores de pesos binarios de t´erminos. Los t´erminos se definen a partir de las palabras o secuencias de caracteres alfanum´ericos, separadas por blancos u otros separadores, una vez eliminadas las etiquetas HTML de las p´aginas. Las palabras se filtran usando una lista de palabras vac´ıa distinta para cada idioma, y se extrae su ra´ız usando el algoritmo de Porter. Se obtienen aproximadamente 16.600 t´erminos distintos para el espa˜ nol y 11.900 para el ingl´es. A continuaci´on, hemos seleccionado para cada idioma un 1 % de los t´erminos originales, usado la Ganancia de Informaci´on como m´etrica de calidad de atributos. De este modo, se usan 166 t´erminos en la representaci´on de los documentos en espa˜ nol, y 119 para el ingl´es. Las clases Seguro y Pornograf´ıa se consideran como clases positiva y negativa, respectivamente.

4.3.

Enfoques evaluados

En este trabajo, hemos evaluado los siguientes algoritmos de aprendizaje: Bayes ingenuo (Lewis, 1998), que genera un modelo probabil´ıstico de las categor´ıas, asumiendo que las apariciones de los t´erminos son independientes entre si. C4.5 (Quinlan, 1993), que induce un a´rbol de decisi´on en el que las ramas van etiquetadas con comprobaciones sobre los valores (pesos) de los atributos (t´erminos), y cuyas hojas van etiquetadas con la clase mayoritaria de los ejemplares de entrenamiento que cumplen las comprobaciones que definen un camino hacia la hoja. PART (Frank y Witten, 1998), que produce listas de reglas de decisi´on. Cada regla consta de una conjunci´on de comprobaciones como antecedente, y una clase

1

como consecuente. Las reglas se aplican consecutivamente, siendo la u ´ ltima una por defecto que asigna la clase mayoritaria de los ejemplares de entrenamiento no cubiertos por las dem´as reglas.

Cada uno de los algoritmos se ha hecho sensible al coste usando los tres metaesquemas descritos en el apartado 2.2 4 . Para cada algoritmo, meta-esquema, e idioma, se han obtenido 41 puntos (FPR,TPR), por validaci´on cruzada con k = 10. Estos puntos se corresponden con las relaciones de coste entre la clase positiva y negativa siguientes: 1/1000, 1/900, . . ., 1/100, 1/90, . . ., 1/10, 1/5, 1, 5, 10, 20, . . ., 90, 100, . . ., 900 y 1000.

5.

Resultados y an´ alisis

En la figura 1 presentamos el recubrimiento convexo de las curvas obtenidas para cada clasificador y meta-esquema de sensibilizaci´on al coste, en forma de dos curvas (del espa˜ nol - SP y del ingl´es – EN). Por claridad, omitimos las 20 curvas correspondientes a cada combinaci´on de algoritmos e idiomas. En la tabla 1 se presentan los puntos de optimalidad para cada uno de los clasificadores que se hallan en el recubrimiento convexo superior, para las colecciones en espa˜ nol e ingl´es. En la primera y segunda columnas aparecen los valores de (FPR,TPR) para cada clasificador o´ptimo. Los clasificadores se identifican por medio de la letra inicial (C – C4.5, P – PART, S – SVM), los metaesquemas por medio de la letra intermedia (W – pesos, M – MetaCost), y las distribuciones por medio de las finales ([i] + coste, siendo i la inversa del coste). As´ı, “SWc020” representa el clasificador obtenido usando SVM adaptadas al coste con el m´etodo de pesos, para una distribuci´on de costes en que los falsos positivos son 20 veces m´as importantes que los falsos negativos. A la vista de estos datos, es rese˜ nable que: 4 Exceptuando la combinaci´ on de MetaCost con los cuatro algoritmos de aprendizaje, para el espa˜ nol.

0,8 0,7 0,6 TP

SVM (Joachims, 2001), que genera en nuestra configuraci´on una funci´on lineal sobre los pesos de los t´erminos, cuya aplicaci´on sobre un nuevo documento da como resultado un valor num´erico. Si es valor es mayor que cero, el documento se clasifica en la clase positiva, y en caso contrario en la negativa.

0,9

0,5 0,4 0,3 0,2 0,1 0 0

0,2

0,4

0,6

0,8

1

FP SP

EN

Figura 1: Recubrimiento convexo superior para los enfoques evaluados, en espa˜ nol (SP) e ingl´es (EN). Espa˜ nol FPR TPR 0,000 0,588 0,001 0,678 0,002 0,726 0,010 0,791 0,025 0,846 0,035 0,854 0,072 0,866 0,180 0,886 0,702 0,974 0,718 0,976 0,815 0,987 0,842 0,990 0,901 0,996 1,000 1,000

Clasif. SW100 SW020 SW005 SW001 SWi005 SWi010 SWi020 SWi030 PWi080 PWi090 SWi300 CWi200 PWi700 CWi400

Ingl´ es FPR 0,000 0,002 0,005 0,006 0,033 0,044 0,153 0,618 0,737 0,981

TPR 0,442 0,558 0,589 0,597 0,682 0,690 0,760 0,930 0,953 1,000

Clasif. SW005 SM001 SM005 SM010 PM010 PM030 PWi060 PWi300 CWi900 PM300

Tabla 1: Puntos y clasificadores o´ptimos para los experimentos en espa˜ nol e ingl´es. Ninguno de los algoritmos y metaesquemas de sensibilizaci´on al coste es claramente dominante para ning´ un idioma, como suele ocurrir en entornos reales. El algoritmo de bayes ingenuo y el esquema basado en umbrales resultan sub´optimos para cualesquiera condiciones operativas. Para los experimentos en espa˜ nol, el algoritmo m´as frecuentemente ganador son las SVM, en combinaci´on con el meta-esquema basado en pesos. Estos resultados son coherentes con los obtenidos para la detecci´on de correo comer-

cial no solicitado en (G´omez, 2002), aunque es preciso resaltar que la combinaci´on con MetaCost no ha sido evaluada en esta colecci´on. Para los experimentos en ingl´es, es sin embargo PART el algoritmo m´as frecuentemente o´ptimo, y MetaCost el meta-esquema m´as efectivo. En condiciones extremas o cercanas a ellas, es decir, cuando no se admite ning´ un falso positivo (p´agina Web pornogr´afica clasificada como segura), el clasificador m´as efectivo es SVM con pesos (espa˜ nol) o con MetaCost (ingl´es). Por ejemplo, con SVM + pesos se alcanza una TPR de 0,588 para el espa˜ nol, lo que significa que se clasifican correctamente un 58,8 % de las p´aginas seguras cuando no se yerra sobre ninguna pornogr´afica. Los porcentajes obtenidos en condiciones extremas son insuficientes para un sistema real, aunque prometedores. Un an´alisis postmortem de los resultados revela deficiencias en la colecci´on de evaluaci´on. Porcentajes significativos de p´aginas poseen poco o ning´ un texto, corresponden a errores de descarga tipo 404, o son p´aginas basadas en marcos.

6.

Conclusiones y trabajo futuro

Aunque los experimentos realizados son prometedores, es preciso avanzar en esta l´ınea, para lo cual nos proponemos: (1) enriquecer y refinar la colecci´on de entrenamiento y evaluaci´on, extrayendo m´as p´aginas del ODP e internas de los sitios Web referenciados, eliminando aquellas correspondientes a errores, y extrayendo el contenido de los marcos HTML; y (2) extender los experimentos a otros algoritmos de aprendizaje (en concreto, usando el algoritmo de los k vecinos m´as cercanos, y el esquema de meta-aprendizaje AdaBoost aplicado a C4.5) y otras representaciones del texto (pesos tipo TF.IDF). Aunque en la pr´actica es imposible alcanzar ´ındices de acierto del 100 %, esperamos alcanzar resultados muy cercanos a ellos.

Bibliograf´ıa ACM. 2002. Internet abuse in the workplace. Communications of the ACM, 45(1). Domingos, P. 1999. Metacost: A general method for making classifiers costsensitive. En Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining.

Frank, E. y I.H. Witten. 1998. Generating accurate rule sets without global optimization. En Machine Learning: Proceedings of the Fifteenth International Conference, p´aginas 144–151. Morgan Kaufmann Publishers. G´omez, J.M. 2002. Evaluating cost-sensitive unsolicited bulk email categorization. En Proceedings of the ACM Symposium on Applied Computing. Joachims, T. 2001. A statistical learning model of text classification with support vector machines. En Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval. ACM Press. Lewis, D.D. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. En Proceedings of the 10th European Conference on Machine Learning, p´aginas 4–15. Springer Verlag. Provost, F. y T. Fawcett. 2001. Robust classification for imprecise environments. Machine Learning Journal, 42(3):203–231. Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann. Salton, G. 1989. Automating text processing: the transformation, analysis and retrieval of information by computer. AddisonWesley. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47. Ting, K.M. 1998. Inducing cost-sensitive trees via instance weighting. En Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, p´aginas 139–147. Witten, I.H. y E. Frank. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):69–90. Yang, Y. y J.O. Pedersen. 1997. A comparative study on feature selection in text categorization. En Proceedings of the 14th International Conference on Machine Learning.

ARTÍCULOS

Artículo 4: Content Based SMS Spam Filtering. Gómez Hidalgo, J. M., Bringas, G. C., Puertas Sanz, E., García, F. C. (2006). Content Based SMS Spam Filtering. In Proceedings of the 2006 ACM Symposium on Document Engineering, ACM Press (pp. 107-114). Impacto Este artículo ha sido publicado en la conferencia ACM Conference on Document Engineering, perteneciente a la asociación de informática más grande desde el punto de vista científico y educativo. El artículo ha recibido 71 citas a fecha de entrega de este trabajo, de acuerdo con Google Scholar. Resumen En los últimos años, hemos sido testigos de un dramático incremento en el volumen de spam de correo electrónico. Otras formas de correo no deseado relacionadas se están revelando como un problema de cada vez mayor importancia, en especial el spam en los servicios de mensajería instantánea (el llamado SPIM), y en el Servicio de Mensajes Cortos (SMS) o correo basura móvil. Al igual que el correo spam, el problema de “spam SMS” puede ser abordado con medidas jurídicas, económicas o técnicas. Entre la amplia gama de medidas técnicas, los filtros bayesianos están jugando un papel clave en la detención de spam de correo electrónico. En este trabajo se analiza en qué medida las técnicas de filtrado bayesiano utilizado para bloquear el spam de correo electrónico, se pueden

201

ARTÍCULOS

aplicar al problema de detectar y detener el spam en dispositivos móviles. En particular, hemos creado dos colecciones de evaluación de spam SMS de tamaño significativo, en inglés y español. Hemos evaluado sobre ellas una serie de técnicas de Ingeniería de Atributos de y algoritmos de Aprendizaje Máquina, en términos de eficacia. Aportaciones Este artículo es el más referenciado de los cuatro presentados en esta Tesis. El motivo es que en él se presenta por primera vez el enfoque de Ingeniería de Atributos y las técnicas de evaluación vistas en Tesis. Fuimos los primeros en utilizarlas y a partir de entonces se ha convertido en un referente para tratar este tipo de problemas. Antes de este artículo no se había tratado el problema del spam usando ingeniería de atributos para luego hacer una evaluación independiente de los costes de los errores.

202

Content based SMS spam filtering José María Gómez Hidalgo

Guillermo Cajigas Bringas

Universidad Europea de Madrid Villaviciosa de Odón 28670 Madrid, SPAIN 34 91 211 5670

Group R&D - Vodafone ES Avenida de Europa, 1 28108 Alcobendas, Madrid, SPAIN 34 610 51 34 93

[email protected]

[email protected]

Enrique Puertas Sánz Francisco Carrero García Universidad Europea de Madrid Villaviciosa de Odón 28670 Madrid, SPAIN 34 91 211 5611

[email protected] [email protected]

ABSTRACT In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called SPIM), and Short Message Service (SMS) or mobile spam. Like email spam, the SMS spam problem can be approached with legal, economic or technical measures. Among the wide range of technical measures, Bayesian filters are playing a key role in stopping email spam. In this paper, we analyze to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. In particular, we have built two SMS spam test collections of significant size, in English and Spanish. We have tested on them a number of messages representation techniques and Machine Learning algorithms, in terms of effectiveness. Our results demonstrate that Bayesian filtering techniques can be effectively transferred from email to SMS spam.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – information filtering. H.3.4 [Information Storage and Retrieval]: Systems and Software – performance evaluation (efficiency and effectiveness).

General Terms Experimentation, Security.

Keywords spam, junk, Bayesian filter, Receiver Operating Characteristic

1.

INTRODUCTION

Mobile spam is a relevant problem in Far East countries since year 2001. In Korea, the volume of mobile spam was already bigger than the volume of email spam at the end of 2003. spam messages are used to advertise dating services, premium rate numbers, or selling drugs and software. Several countries have taken legal and technical measures to control the SMS spam problem. Japan government filed two acts in 2002, which defined and penalized email and mobile abuse. These laws, the effort of self-regulation from Mobile Network operators, and some technical limitations, have helped to reduce the volume (but not to quit) mobile spam. All in all, experts consider that mobile spam

can only get controlled through the combination of technical and legal measures. SMS spam has been regarded as a minor problem in Western countries, mostly because the cost of sending spam messages is much bigger than that of sending email spam. But in Europe, SMS messaging is the fashion: nearly all people over 15 years old own a mobile phone, and an average user sends about 10 SMS a day. That makes SMS messages a perfect target for abuse. Moreover, botnets of zombie PCs are being used to emulate real users when sending SMS messages through free SMS messaging services at e.g. Russia. So, SMS spam cost is decreasing. In other words, mobile spam can pay. In fact, more than 80% of users in CE admit to have received mobile spam A variety of technical measures against spam email have been already proposed (see below). Most of them can effectively be transferred to the problem of mobile spam. One of the most popular ones is the so called Bayesian filters: programs that, through learning, are able to discriminate legitimate from spam messages. In this paper, we study the possibility of applying Bayesian filtering techniques to the problem of SMS spam filtering. First, we review a number of technical measures against spam email, focusing on Bayesian filtering. Then we present how the Mobile Network operator sees the problem of mobile spam in Europe, and which role can Bayesian filtering can play in reducing or stopping it. After, we describe a series of experiments designed to test to what extent it is effective addressing the mobile spam problem.

2. TECHINCAL MEASSURES AGAINST SPAM EMAIL Millions of spam email messages are sent every day, advertising pornographic Web sites, drugs or software products, or of fraud (phishing) [2]. spam email has an important economic impact in end users and service providers. The increasing importance of this problem has motivated the development of a set of techniques to fight it, or at least, providing some relief.

2.1

Spam Email Filtering

The most popular techniques used to reduce spam nowadays include the following ones. White and black listing. The senders occurring in a black list (e.g. RBL) are considered spammers, and their messages blocked. The messages from senders in a white list (e.g. the address book, or the provider itself – Hotmail) are considered legitimate, and thus delivered.

have been previously represented. The shape of the classifier depends on the learning algorithm used, ranging from decision trees (C4.5), or classification rules (Ripper), to statistical linear models (Support Vector Machines, Winnow), neural networks, genetic algorithms, etc.

Postage. For delivering a message, postage is required: economic (e.g. on cent per message), computational (e.g. the computation of a costly function) or Turing-test based (e.g. the verification of the sender being a person instead of a program). See e.g. Spamproof. Address management. It consists on the usage of temporal, machine-generated addresses, which are managed automatically by the system, and discarded when they begin receiving spam (see HP Channels).

Each new target message is pre-processed, tokenized, represented and feed into the classifier, in order to take a classification decision on it (whether it is spam or not).

Collaborative filtering. When a user tags a message as spam, this is considered spam for users similar to him/her. Alternatively, the service provider considers that massive messages (for e.g. more than N=20 recipient) are spam.

Current methods in Bayesian filter development are focused on the first steps, given that the quality of representation has big impact on the accuracy of the learned model.

Digital signatures. Messages without a digital signature are considered spam. Digital signatures can be provided by the sender or the service provider (e.g DomainKeys).

2.3

Content-based filtering [14]. The most used method. Each messaged is searched for spam features, like indicative words (e.g. “free”, “viagra”, etc.), unusual distribution of punctuation marks and capital letters (like e.g. in “BUY!!!!!!”), etc. Most of the techniques above can be directly applied to the problem of mobile spam, but among them, content-based filtering (and in particular, Bayesian filtering) is playing a key role in reducing spam email. In fact, its success has forced spammers to periodically change their practices, and to disguise their messages, in order to bypass these kinds of filters. In this work, we focus on Bayesian filtering SMS spam in Europe.

2.2

Bayesian Filtering Techniques

Content-based spam filters can be built manually, by handengineering the set of attributes that define spam messages. These are often called heuristic filters [4], and some popular filters like SpamAssassin have been based on this idea for years. Contentbased filters can also be built by using Machine Learning techniques applied to a set of pre-classified messages [16]. These so-called Bayesian filters are very accurate according to recent statistics[1], and their applicability to SMS spam seems immediate. Bayesian filters [3] automatically induce or learn a spam classifier from a set of manually classified examples of spam and legitimate (or ham) messages (the training collection). The learning process takes as input the training collection, and consists of the following steps [14]: •

Preprocessing. Deletion of irrelevant elements (e.g. HTML), and selection of the segments suitable of processing (e.g. headers, body, etc.).



Tokenization. Dividing the message into semantically coherent segments (e.g. words, other character strings, etc.).



Representation. Conversion of a message into an attribute-value pairs vector [13], where the attributes are the previously defined tokens, and their values can be binary, (relative) frequencies, etc.



Selection. Statistical deletion of less predictive attributes (using e.g. quality metrics like Information Gain).



Learning. Automatically building a classification model (the classifier) from the collection of messages, as they

Application to Mobile Spam

Since having a good term representation is one of the most important parts for getting a good classifier, we have to face the fact that SMS messages have not the same structure and characteristics than email messages. We have described techniques used to filter spam email messages, but we cannot state they can be also effective filtering SMS. SMS are usually shorter than email messages. Only 160 characters are allowed in a standard SMS text, and that could be a problem because less words means less information to work with. Also, due to the above constraint, people tend to use acronyms when writing SMS. Moreover, the abbreviations used by SMS users are not standard for a language, but they depend on the users communities. Such language variability provides more terms or features, and a more sparse representation. We have to test if the state of the art methods used to extract terms from email messages are also suitable for SMS texts.

3.

SMS SPAM IN EUROPE

The threat of mobile spam is clear, as everybody feels like the mobile handset has become a very personal piece of technology and want to keep it useful, personal and free of invasions like viruses and spam. From the European Mobile Network operator (MNO) point of view, we can classify mobile spam according to how it is produced. The final user can roughly receive mobile spam from three main sources: •

MNO or parties that pay MNOs for delivering the SMS to the final user.



Parties that manage not to pay for the SMS that are finally delivered to the user.



User originated messages that bother the receiver.

The first case seems to be the main responsible of the high number of users admitting they have received spam in Europe. MNOs, third parties and authorities have adopted and enforced the use of opt-out, or even opt-in (for the case of third parties) processes for the user to stop receiving promos or ads. MNOs often disconnect parties that do not comply with MNO policies of legitimate SMS. The second case is usually worse as it is a fraud, and not only it damages MNO brand but also its revenue stream. MNOs have already installed tools and processes to detect and cut off this kind of sources.

Finally, although the third case is statistically irrelevant, it can produce user complaints and it cannot be easily managed due to SMS content privacy regulations and business commitments acquired with user in SMS service delivery. In fact, processes at the MNO side and regulations at the authorities’ side seem to be effective enough to have lowered the number of user complaints and keep them stable in Europe. To be honest, this maybe can also be imputed to the tolerance raise that the users seem to experience. Anyhow it also true that this kind of tools and processes are reactive and do leak some of this mobile spam to final users. For instance, from the moment the second kind of abuse begins, to the moment it is terminated, the end users still receive SMS spam, and they perceive a decrease of the quality of service. It is also remarkable that spam definition is user dependent: what I consider spam could be information to for you. In this context, personal or easily personalized, learning filtering tools could help in reducing even more the final user’s complaints, thus helping MNO delivering a better service to them.

4.

EXPERIMENTS AND RESULTS

We have conducted a series of experiments, with different attribute definitions, several learning algorithms and a suitable evaluation method, in order to test if Bayesian filtering techniques can be easily transferred to SMS spam filtering. There is some evidence of this [17], but more systematic experiments are required.

4.1

Test Collections

We have built two different collections: one with SMS messages in Spanish and another one with messages in English.

4.1.1

4.2

We have decided to use the following set of attributes to represent SMS messages (either spam or legitimate): •

Words – sequences of alpha-numeric characters in the message text. We consider that any non-alphanumeric character is a separator. Words are the building blocks of any message.



Lowercased words – lowercased words in the text message, according to the definition of word above. This way, we map different strings to the same attributes, obtaining new token frequencies that should affect the learning stage.



Character bi-grams and tri-grams – sequences of 2 or 3 characters included in any lowercased word. This attributes try to capture morphological variance and regularities in a language-independent way. For instance, if the simple past’s suffix “ed” in English, is representative of spam messages, we would be able to capture this knowledge.



Word bi-grams – sequences of 2 words in a window of 5 words preceding the current word. On one side, it has been demonstrated that most relevant dependences between words in a language are present only in a window of size 5; that is, rarely a word influences another that is far of 5 tokens from itself. On the other side, this kind of word bi-grams have been proven very useful in spam email detection

English test database

A list of 202 legitimate messages, probably collected by Jon Stevenson, according to the HTML code of the Web page. Only the text of the messages is available. We will call this corpus the Jon Stevenson Corpus (JSC).



A collection of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore, called the NUS SMS Corpus (NSC). The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.

Message Processing and Encoding

According to our previous experience and the literature on spam email filtering, the tokenization step is probably the most important one in the process of analysing messages and learning a classifier on them. A bad representation of the problem data may lead to classifier of poor quality and accuracy. So we have scanned carefully the literature in spam email detection, in order to build a set of attributes or parameters that guarantee that learning will be successful.

We also built an English SMS database by using freely available resources on the Internet. After a detailed search, we have found the following resources: •

A collection of 82 SMS spam messages extracted manually from Grumbletext. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning 100 web pages.

By using the messages collected by Jon Stevenson, and a random sample of the messages in the NUS SMS Corpus, we have built SMS English database consisting of 1,119 legitimate messages, and 82 spam messages. We believe this collection resembles a realistic scenario, because both the legitimate and the spam messages are real messages; the proportion may be not accurate, but we are not aware of the existence of real world statistics of spam received by cell phone users in the British/Singapore markets.

Spanish test database

For this round of experiments we have been provided by Vodafone with a sample of Spanish spam messages, obtained from Vodafone users. The legitimate messages have been taken from a joke by SMS competition. We have built and used a message database consisting of 199 (14.67%) spam messages, and 1,157 (85.32%) legitimate messages. A trivial rejecter (the classifier that always selects the majority class (legitimate or “ham” in our case), would show an accuracy of 0.85.

4.1.2



4.3

Feature Selection

In a data mining inspired approach, we have decided to feed the test with a possibly very big number of attributes, letting the attribute selection step the responsibility for deleting less

informative attributes. We use Information Gain (IG) [18] as attribute quality metric. The experience in learning-based text classification is that IG can reduce substantially the number of attributes, without no loss (or even some improvement) of accuracy. We have made experiments selecting all tokens scoring over 0 (zero) in Information Gain, and sets with 100 and 200 tokens with highest IG. Such tokens may provide information for the spam class (that is, they correlate to it), or for the legitimate class.

4.4

Machine Learning Algorithms

For our experiments we have used the following algorithms: •

Naive Bayes (NB) [10]. This is the most immediate and simple Bayesian learning method. It is based on the Bayes theorem and an independence hypothesis, generating an statistical classifier based on probabilities. Its name has been borrowed by the class of Machine Learning based email spam filters. It is the simplest technique, and it must be used before trying more complex methods.



C4.5 [12]. A classical learning method that outputs decision trees, with average results on text classification problems. Its main advantage is that decision trees are a simple, clear and expressive formalism.



PART [7]. This algorithm produces classification rules, based on generating decision rule lists from partial decision trees. It shares the advantages and limitations of C4.5.



Support Vector Machines (SVM) [9]. An optimisation algorithm that produces (linear) vectors that try to maximally separate the target classes (spam versus legitimate). It is a complex and (relatively) recent algorithm (1997), which has shown excellent results in text classification applications. However, it is difficult to interpret its output.

These algorithms represent a good sample of the current learning algorithms available in the market, and their learning strategies are different, as their learning bias [17]. For this round of experiment, we have also used stratification or re-weighting mechanism, trying to make algorithms more sensitive to the (underrepresented) class spam. The goal is to reduce the number of false negatives (spam classified as Ham), in order to block more spam messages [8]. The mechanism consists of incrementing the weight of (or giving more importance to) spam messages in the training collection [15], what is equivalent to stratification: incrementing the number of spam messages by inserting more copies of available ones. For instance, if spam messages are given a weight of 50 (against a weight of 1 for a legitimate message), we force the algorithm to consider each spam message as 50 messages. Since the algorithms we use try to minimize the error (or to maximize the accuracy), a mistake on a spam message (a false negative) is 50 times more important than a mistake on a legitimate message (a false positive) [11]. Anyway, we use cost-weighting as a mechanism focused on finding the accuracy tolerance of the studied algorithms in the provided test collection. In our tests, we use 10 and 100 weights for spam messages, against a stable weight of 1 for legitimate messages.

4.5

Evaluation Setup

As we need to select the most suitable tokenization and learning methods for an operation environment, we have to perform a evaluation oriented to a number of variable conditions, in terms of class distribution (and balance) and cost of false positives (and negatives), that can not be known in advance. The most suitable evaluation method for such an imprecise environment is the Receiver Operating Characteristic Convex Hull method[11]. We have employed this method in previous experiments in spam email filtering, and in Web content filtering, two applications that share this imprecise nature; also, this method has became an standard in spam filtering, according to the most well-known spam email filtering competition nowadays, the TREC spam Track.

4.5.1

The ROCCH Method

The Receiver Operating Characteristics (ROC) analysis is a method for evaluating and comparing a classifiers performance [8]. It has been extensively used in signal detection, and it was introduced and extended in [11] for the Machine Learning community. In ROC analysis, instead of a single value of accuracy, a pair of values is recorded for different class and cost conditions a classifier is learned. The values recorded are the False Positive rate (FP) and the True Positive rate (TP), defined in terms of the confusion matrix as: FP =

fp fp + tn

TP =

tp tp + fn

The TP rate is equivalent to the recall of the positive class, while the FP rate is equivalent to 1 less the recall of the negative class. Each (FP, TP) pair is plotted as a point in the ROC space. Most ML algorithms produce different classifiers in different class and cost conditions. For these algorithms, the conditions are varied to obtain a ROC curve. One point on a ROC diagram dominates another if it is above and to the left, i.e. it has a higher TP and a lower FP. Dominance implies superior performance for a variety of commonly performance measures, including Expected Cost (and then WA and WE), recall and others. Given a set of ROC curves for several ML algorithms, the one which is closer to the left upper corner of the ROC space represents the best algorithm. The ROC analysis allows a visual comparison of the performance of a set of ML algorithms, regardless of the class and cost conditions. This way, the decision of which is the best classifier or ML algorithm can be delayed until target (real world) conditions are known. Instead of drawing a ROC curve through threshold variation, we can vary the class and cost conditions and obtain for each of them a (FP, TP) point using the Threshold method. This view of ROC curve plotting allows to use other methods for making ML algorithms cost-sensitive. For instance, one can use techniques as Stratification or MetaCost [5] applied to a ML algorithm for inducing a set of classifiers for a range of class and cost conditions, and then linking the obtained (FP,TP) points to form a ROC curve.

This is the basis of the method we have applied to obtain ROC curves for a range of methods for making ML algorithms costsensitive.

4.6

then, the cost ratio, meaning e.g. 020 that a false positive (classifying a legitimate messages as spam) is 20 times more important than a false negative, and e.g. i030 that a false positive is 30 times less important than a false negative;



finally, the number of attributes used for learning, meaning IG0 those attributes with an IG score over zero.

Results and Discussion

The output of a ROCCH evaluation is a table or plot showing the ROC points or curves for a number of algorithms, and a table showing the slope ranges in which the classifiers laying on the Convex Hull are optimal.

4.6.1

Results and analysis, English database

First, we show in the Figure 1 the result of experiments for all the tested algorithms and each number of attributes (100, 200 and all with IG over zero) for the English language database. In the next plot, we show the ROC curves for each algorithm (NB, C45, PART, SVM) and their Convex Hull, for the corresponding number of attributes, representing the optimal classifiers for all attribute set sizes tested. ROC PLOT, ENGLISH, SUMMARY 1,00

0,95

0,90

TP



0,85

0,80

0,75

0,70 0,00

0,05

0,10

0,15

0,20

0,25

0,30

FP 100-CH

200-CH

IG0-CH

Table 1. Slope ranges for various settings, English database Slope ranges, English, 100 attributes Slope Range ROC point Classifier [0.000,0.038] 1.000,1.000 AllPos [0.038,2.000] 0.028,0.963 NB-i090 [2.000,8.333] 0.004,0.915 SVM-i010 [8.333,36.000] 0.001,0.890 SVM-001 [36.000,Inf] 0.000,0.854 SVM-005 Slope ranges, English, 200 attributes [0.000,0.025] 1.000,1.000 AllPos [0.025,2.259] 0.028,0.976 NB-005 [2.259,13.000] 0.001,0.915 SVM-i005 [13.000,Inf] 0.000,0.902 SVM-001 Slope ranges, English, attributes IG>0 [0.000,0.025] 1.000,1.000 AllPos [0.025,6.256] 0.043,0.976 NB-1000 [6.256,Inf] 0.000,0.707 SVM-001 Slope ranges, English, summary [0.000,0.025] 1.000,1.000 AllPos [0.025,2.259] 0.028,0.976 NB-005-200 [2.259,13.000] 0.001,0.915 SVM-i005-200 [13.000,Inf] 0.000,0.902 SVM-001-200 For a given class and cost distribution, we can compute the corresponding slope and find the optimal classifier. For instance, given the current class distribution in the database: P(+) =0,07, P(-) =0,93, and a cost distribution of 1/1 (that means balanced error costs, C(+,-) = C(-,+)), the distribution slope S is:

ALL-CH

S=

Figure 1. The ROC curves and Convex Hull, English database. We must remind the reader that in a ROC plot, the optimal point is (0,1), that represents no False Positives (no legitimate messages classified as spam) and a maximum of True Positives (spam messages classified as spam). The closer a point is to this point (or to the upper left corner of the plot), the better it performs. For all the attribute set sizes, the dominant classifier is SVM. This fits experiments in the literature on text classification and spam email detection, where this algorithm has performed usually much better than the others tested in this work. This can be also observed in the Table 1, showing the optimal classifiers for the slope ranges presented. The different classifiers (and their (FP,TP) points) correspond to 10-fold cross validated tests encoded the following way: •

first the code of the algorithm (NB, C45, PART, SVM);

P(− ) × C(+,− ) 0,93 = = 13,28 P(+ ) × C(−,+ ) 0,07

Given this slope value, the optimal classifiers are: •

SVM-001 (Support Vector Machines with Cost Ratio of one), for the cases with 100 and 200 attributes, or those attributes IG-scoring over zero.



SVM-001-200 (the previous one but with 200 attributes) in general, because it achieves a better TP rate for the same or less FP rate.

It is remarkable that for 100 attributes, SVM makes one FP, capturing an 85,4% of spam messages, but in the other cases, there are not FPs. The optimal situation, represented by the SVM001-200 classifier, allows detecting over 90% of spam messages, in a quite safe-for-the-user environment. These results demonstrate that SVMs are optimal under the Newman-Pearson criterion for the most reasonable operating scenario. This criterion consists of setting a maximum acceptable FP rate FPMAX, which corresponds to a vertical line in a ROC graph, and selecting the best classifier with FP rate under

FPMAX, that is, the one with optimal TP rate. In safe-for-the-user environment, we set the FP rate to zero (no false positives allowed), finding also the same optimal classifier, SVM-001-200.

4.6.1.1 Results and analysis, Spanish database Next, we show in Figure 2 the result of experiments for all the tested algorithms and each number of attributes (100, 200 and all with IG over zero) for the Spanish language database. ROC PLOT, SPANISH, SUMMARY 1,00

0,95

TP

0,90

0,85

0,80

0,75

0,70 0,00

0,05

0,10

0,15

0,20

0,25

0,30

FP 100-CH

200-CH

IG0-CH

ALL-CH

Figure 2. The ROC curves and Convex Hull, Spanish database. The optimal classifiers for the different attribute set sizes, and slope ranges, are given in the Table 2. This table follows the notation of the previous section. The most remarkable fact is that, again, Support Vector Machines perform over the other algorithms most of the time. Given that the distribution of messages is more balanced in this collection, allowing to find more predictive attributes, accuracy can score over 95% if we set the FP rate to zero, or even over 99% if we allow few FPs. Let us examine the slope corresponding to the actual distribution of classes and costs in the collection. Given that P(+)=0,146 and P(-)=0,853, and for a cost ratio of one, the slope value is S = 5,81. The most accurate (and appropriate) classifier for this conditions is SVM-i050-IG0 (Support Vector Machines trained with all attributes with IG over zero, in a cost ratio of 1/50 – a false negative is 50 times more important that a false positive))1. It may seem counter-intuitive to over-weight false negatives, being these less desirable than false positives (legitimate messages classified as spam), but we must note that the ROCCH method involves testing the algorithms over a full range of (cost) conditions. Also, 1

One of the strengths of the ROCCH method is that it is able to detect that a specific classifier for a given cost may me optimal for other cost distributions, given that the class distribution affects also classifiers learning and performance.

an even not being on the Convex Hull (or optimal performance plot), the SVM-020-IG0 reaches a TP rate of 0,955 (that is, 95,5% of spam messages detected) with a FP rate of zero, very close to the optimal one. Table 2. Slope ranges for various settings, Spanish database. Slope ranges, Spanish, 100 attributes Slope Range ROC point Classifier [0.000,0.006] 1.000,1.000 AllPos [0.006,0.114] 0.101,0.995 SVM-i200 [0.114,1.429] 0.057,0.990 SVM-i060 [1.429,1.765] 0.050,0.980 SVM-i040 [1.765,2.364] 0.016,0.920 SVM-i005 [2.364,28.333] 0.005,0.894 SVM-001 [28.333,45.000] 0.002,0.809 SVM-010 [45.000,206.00] 0.001,0.764 SVM-030 [206.00,Inf] 0.000,0.558 SVM-300 Slope ranges, Spanish, 200 attributes Slope Range ROC point Classifier [0.000,0.011] 1.000,1.000 AllPos [0.011,0.011] 0.532,0.995 PART-i900 [0.011,0.448] 0.089,0.990 SVM-i500 [0.448,1.667] 0.022,0.960 SVM-i030 [1.667,5.000] 0.010,0.940 SVM-i005 [5.000,17.200] 0.006,0.920 SVM-001 [17.200,256.00] 0.001,0.834 SVM-020 [256.00,Inf] 0.000,0.578 SVM-400 Slope ranges, Spanish, attributes IG>0 Slope Range ROC point Classifier [0.000,0.040] 1.000,1.000 AllPos [0.040,Inf] 0.000,0.960 SVM-i050 Slope ranges, Spanish, summary Slope Range ROC point Classifier [0.000,0.006] 1.000,1.000 AllPos [0.006,0.114] 0.101,0.995 SVM-i200-100 [0.114,0.526] 0.057,0.990 SVM-i060-100 [0.526,Inf] 0.000,0.960 SVM -i050-IG0 Finally, the SVM-i050-IG0 is the optimal classifier under the Newman-Pearson criterion, when setting the maximum allowed FP rate to zero.

5.

CONCLUSIONS

From this series of experiments, we can derive the following conclusions: •

Given the short size of messages, and the literature on spam email filtering, it is reasonable to define a wide range of attribute types, and let the attribute selection using IG process to select those most promising for classification. However, the number of selected attributes can not be known in advance, although it seems proportional to the spam messages. It may be valuable to test other kinds of features (e.g. encoding all numbers, or marking telephone numbers).



The most suitable learning algorithm for a prototype is, after in-depth evaluation, Support Vector Machines. This is supported by our and others’ previous work in spam email detection and in text classification. Also, although

we have not demonstrated this empirically, the running time of learning with Support Vector Machines has been comparable to Naïve Bayes, and much smaller than the running time for learning rules or decision trees.

6.

ACKNOWLEDGMENTS

Our thanks to VODAFONE for funding this research and providing us with the Spanish test data.

7. [1]

REFERENCES Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D. An Evaluation of Naive Bayesian Anti-spam Filtering. Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), pp. 9-17, 2000.

[2]

Christine E. Drakeand Jonathan J. Oliver, Eugene J. Koontz. Anatomy of a Phishing Email. Proceedings of the First Conference on Email and Anti-spam (CEAS), 2004.

[3]

Graham, Paul. Better Bayesian Filtering. Proceedings of the 2003 Spam Conference, January 2003.

[4]

Gómez, J.M., Maña-López, M., Puertas, E. Combining Text and Heuristics for Cost-Sensitive spam Filtering. Proceedings of the Fourth Computational Natural Language Learning Workshop, CoNLL-2000, Association for Computational Linguistics, 2000.

[5]

Domingos, P. 1999. Metacost: A general method for making classifiers cost- sensitive. En Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining.

[6]

Drucker, H, Vapnik, V., Wu, D. Support Vector Machines for spam Categorization. IEEE Transactions on Neural Networks, 10(5), pp. 1048-1054, 1999.

[7]

Frank, E., I.H. Witten. 1998. Generating accurate rule sets without global optimization. Machine Learning: Proceedings of the Fifteenth International Conference.

[8]

Gómez, J.M. 2002. Evaluating cost-sensitive unsolicited bulk email categorization. Proceedings of the ACM Symposium on Applied Computing.

[9]

Joachims, T. 2001. A statistical learning model of text classification with support vector machines. En Proceedings of the 24th ACM International Conference on Research and Development in Information Retrieval. ACM Press.

[10] Lewis, D.D. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval. En Proceedings of the 10th European Conference on Machine Learning. Springer Verlag. [11] Provost, F., T. Fawcett. 2001. Robust classification for imprecise environments. Machine Learning Journal, 42(3):203-231. [12] Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann. [13] Salton, G. 1989. Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley. [14] Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47. [15] Ting, K.M. 1998. Inducing cost-sensitive trees via instance weighting. En Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, 139-147. [16] Witten, I.H., E. Frank. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. [17] Xiang,Y., Chowdhury, M., Ali, S. Filtering Mobile spam by Support Vector Machine. Proceedings of CSITeA-04 , ISCA Press, December 27-29, 2004. [18] Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):69-90. [19] Yang, Y., J.O. Pedersen. 1997. A comparative study on feature selection in text categorization. En Proceedings of the 14th International Conference on Machine Learning.

CONCLUSIONES Y RESUMEN DE APORTACIONES

VI.

CONCLUSIONES Y RESUMEN DE APORTACIONES

El objetivo de este trabajo ha consistido en mejorar la efectividad de las técnicas de Categorización de Texto con Adversario. Por ello hemos enfocado el problema como una tarea de Categorización de Texto en dónde las fases más críticas y sensibles son las que corresponden con la Ingeniería de Atributos y con la Evaluación. La fase de Ingeniería de Atributos es crítica, porque es en la que suelen centrarse los ataques de los adversarios, que pretenden degradar la eficiencia de los clasificadores atacando las fases de indexación y representación de los documentos para confundir al algoritmo de aprendizaje Automático. En este trabajo proponemos que un correcto uso de estas técnicas, adaptadas al problema que se está tratando, mejora los resultados de clasificación. Hemos aplicado estas técnicas a tres dominios, el filtrado de correo basura (spam), el filtrado de contenidos inapropiados de internet, y el filtrado de spam sms, que tiene la peculiaridad de tratar con documentos de texto muy coros (de menos de 16 caracteres). En estos dominios hemos realizados experimentos con distintos algoritmos de Clasificación de Texto para ver cuáles resultaban más efectivos. Por otro lado, la fase de evaluación es crítica porque este tipo de tareas con adversario suele tener los costes de los errores asimétricos, es decir, el coste para el usuario final de una mala clasificación no suele ser igual para todas las clases. A esto hay que unirle que estos costes finales suelen ser desconocidos a priori en el momento en que se está realizando el aprendizaje.

211

CONCLUSIONES Y RESUMEN DE APORTACIONES

Por este motivo nosotros hemos propuesto una metodología de evaluación independiente del coste de los errores y del desequilibrio de documentos que existan en cada clase. Hemos sido pioneros en el uso de estas métricas de evaluación, que han pasado además a ser un referente para la evaluación de tareas de Clasificación con Adversario.

212

REFERENCIAS

VII.

REFERENCIAS

[Agarwal06] Agarwal, N., Liu, H., and Zhang, J. 2006. Blocking objectionable web content by leveraging multiple information sources. SIGKDD Explor. Newsl. 8, 1 (Jun. 2006), 17-26. [Androutsopoulos00a] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D. Spyropoulos. 2000. An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML), 9-17,Barcelona, Spain. [Androutsopoulos00b] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos and P. Stamatopoulos. 2000. Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a MemoryBased Approach". In Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 1-13, Lyon, France. [Androutsopoulos00c] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos. An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Encrypted Personal Email Messages. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 160-167, Athens, Greece. [Caropreso01] Caropreso M. F., Matwin S., and Sebastiani F., 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice, A. G. Chin, editor, pp. 78–102. Idea Group Publishing, Hershey, US.

213

REFERENCIAS

[Chou08] C.-H. Chou, A. Sinha, and H. Zhao, "A text mining approach to internet abuse detection," Information Systems and E-Business Management, 2008. [Cumming04] Graham-Cumming J., 2004. How to beat an adaptive spam filter. In MIT Spam Conference. [Dalvi04] Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai and Deepak Verma. Adversarial Classification. Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining (pp. 99-108), 2004. Seattle, WA: ACM Press. [Deerwester90] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. 1990. Indexing by latent semantic indexing. Journal of the American Society for Information Science, 41(6):391– 407. [Dumais98] Dumais, S. T., Platt, J., Heckerman, D. y Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. En G. Gardarin, J. C. French, N. Pissinou, K. Makki and L. Bouganim, eds., Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management , pgs. 148– 155. ACM Press, New York, US, Bethesda, US. [Eagles95] EAGLES, Evaluation Working Group. 1995. Evaluation of Natural Language Processing Systems. Informe Técnico EAG-EWGPR.2, Expert Advisory Groups for Language Engineering Standards Evaluation Working Group. [Fayyad96] Fayyad, U., Piatetsky-Shapiro, G. y Smyth, P. 1996. Knowledge

Discovery

and

Data

Mining:

Towards

an

Unifying

Framework. En Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press.

214

REFERENCIAS

[Gomez00] José María Gómez-Hidalgo, Manuel Maña-López, and Enrique Puertas-Sanz. Combining text and heuristics for cost-sensitive spam filtering. In Proceedings of the Fourth Computational Natural Language Learning Workshop, CoNLL-2000, Lisbon, Portugal, 2000. Association for Computational Linguistics. [Gomez02a] Gómez-Hidalgo J. M., 2002. Evaluating cost-sensitive unsolicited

bulk

email

categorization. In Proceedings of SAC-02,

17th ACM Symposium on Applied Computing, pp. 615–620. Madrid, ES. [Gomez02b] Gómez-Hidalgo J. M., Maña-López M., and Puertas-Sanz E.,2002.Evaluating cost-sensitive unsolicited bulk email categorization. In Proceedings of JADT-02, 6th International Conference on the Statistical Analysis of Textual Data, Madrid, ES. [Gomez03] Gómez Hidalgo, J.M., Puertas Sanz E., Carrero García, F., Buenaga Rodríguez, M. de. Categorización de texto sensible al coste para el filtrado de contenidos inapropiados en Internet. Procesamiento del Lenguaje Natural, 31, pp. 13-20, 2003. [Graham03] Paul Graham. Better Bayesian filtering. In Proceedings of the

2003

Spam

Conference,

Jan

2003.

Available:

http://www.paulgraham.com/better.html. [Greevy04] Greevy E and Smeaton A.F., "Text Categorisation of Racism Using a Support VectorMachine", Proceedings of JADT, March 2004. [Guermazi07] Radhouane Guermazi, Mohamed Hammami, Abdelmajid Ben Hamadou, 2007. Combining Classifiers for Web Violent Content Detection and Filtering Computational Science – ICCS 2007, 773-780.

215

REFERENCIAS

[Hearst99] Hearst, M. 1999. Untangling Text Data Mining. En Proceedings of ACL’99: The 37th Annual Meeting of the Association for Computational Linguistics. [Joachims98] Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. En C.˜ Nedellec y C. Rouveirol, eds., Proceedings of ECML-98, 10th European Conference on Machine Learning, p´ags. 137–142. Springer Verlag, Heidelberg, DE, Chemnitz, DE. Published in the “Lecture Notes in Computer Science” series, number 1398. [Kam07] Kam K., 2007. 4 Dangers of the Internet. WebMD, available at: http://www.webmd.com/parenting/ features/4-dangers-internet. [Kim06] Y. Kim, T. Nam, and D. Won, 2006. "2-way text classification for harmful

web documents," Computational

Science and Its

Applications - ICCSA 2006, pp. 545-551. [Larkey96] Larkey, L. S. y Croft, W. B. 1996. Combining classifiers in text categorization. En H.-P. Frei,D. Harman, P. Sch¨auble and R. Wilkinson, eds., Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval , pgs. 289–297. ACM Press,New York, US, Zurich, CH. [Lee02] P. Y. Lee, S. C. Hui, and A. C. M. Fong, "Neural networks for web content filtering," Intelligent Systems, IEEE, vol. 17, no. 5, pp. 48-57, 2002. [Lewis92] Lewis, D. D. 1992. Representation and learning in information retrieval . Tesis Doctoral, Department of Computer Science, University of Massachusetts, Amherst, US. [Lewis97] Lewis, D. D. 1997. Reuters-21578 text categorization test collection. Distribution 1.0.

216

REFERENCIAS

[Li98] Li, Y. H. y Jain, A. K. 1998. Classification of text documents. The Computer Journal , 41(8):537–546. [Pantel98] Pantel, P. y Lin, D. 1998. SpamCop: A Spam Classification and Organization Program. En Learning for Text Categorization: Papers from the 1998 Workshop. AAAI Technical Report WS-98-05, Madison, Wisconsin. [Polpinij06] Polpinij, J., Chotthanom, A., Sibunruang, C., Chamchong, R., Puangpronpitag, S., 2006. Content-based text classifiers for pornographic web filtering. In: Systems, Man and Cybernetics, 2006. SMC '06. IEEE International Conference on. Vol. 2. pp. 1481-1485. [Puertas08] Puertas Sánz, E., Gómez Hidalgo, J.M., Cortizo Pérez, J.C. Email Spam Filtering. In Marvin Zelkowitz (Ed.) Advances In Computers, 74, ISBN-13: 978-0-12-374426-5, Elsevier Academic Press, in press [Sahami98] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz.. A Bayesian approach to filtering junk e-mail. In Proceedings of the AAAI98 Workshop on Learning for Text Categorization, Madison, WI, 1998. AAAI Press. [Sakkis01] Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. In Proceedings of EMNLP-01, 6th Conference on Empirical Methods in Natural Language Processing, Pittsburgh, US, 2001. Association for Computational Linguistics, Morristown, US. [Salton81] Salton, G. 1981. A blueprint for automatic indexing. SIGIR Forum, 16(2):22–38.

217

REFERENCIAS

[Salton83] Salton, G. y McGill, M. J. 1983. Introduction to Modern Information Retrieval . McGraw Hill, New York, US. [Salton83] Salton, G. y McGill, M. J. 1983. Introduction to Modern Information Retrieval . McGraw Hill, New York, US. [Schaphire00] Schapire R., and Singer Y., 2000. BoosTexter: A boosting-based system for text categorization.Machine Learning, 39(2– 3): 135–168. [Sebastiani02] Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47. [Sebastiani06] Sebastiani F., 2006. Classification of text, automatic. In The Encyclopaedia of Language and Linguistics, (K. Brown, Ed.), 2nd edn, vol. 14, pp. 457–462. Elsevier Science Publishers, Amsterdam, Netherlands. [Su04] Su GY, Li JH, Ma YH, Li SH. 2004. Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model. J Zhejiang Univ Sci. 2004 Sep;5(9):1106-13. [Wiener95] Wiener E. D., Pedersen J. O., and Weigend A. S., 1995. A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Infor- mation Retrieval, pp. 317–332. Las Vegas, US. [Wynn01] Wynn A., and Trudeau P., 2001. Internet abuse at work: Corporate networks are paying the price.SurfControl White Paper. [Yang97] Yang, Y. y Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. D. H. Fisher, ed., Proceedings of ICML-97, 14th International Conference on Machine Learning.

218

REFERENCIAS

[Yang97] Yang, Y. y Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. D. H. Fisher, ed., Proceedings of ICML-97, 14th International Conference on Machine Learning. [Yang99] Yang, Y. y Liu, X. 1999. A re-examination of text categorization methods. En M. A. Hearst, F. Gey y R. Tong, eds., Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval , pags. 42–49. ACM Press, New York, US, Berkeley, US. [Yerazunis03] Bill Yerazunis. Sparse binary polynomial hash message filtering and the CRM114 discriminator. In Proceedings of the Spam Conference, 2003. [Yerazunis04] Bill Yerazunis. The plateau at 99.9. In Proceedings of the Spam Conference, 2004. [Zdziarski04] Jonathan Zdziarski. Advanced language classification using chained tokens. In Proceedings of the Spam Conference, 2004.

219

REFERENCIAS

220

ANEXOS

VIII.

ANEXOS

Escritos de conformidad de los Coautores de los artículos

221

MODELO&DE&ACEPTACIÓN&DE&LOS&COAUTORES& DOCTORES&Y&NO&DOCTORES & & D o n $ J o s é $ C a r l o s $ C o r t i z o P é r e z , $con$el$n.$de$DNI$o$pasaporte$53137804F$ nacido/a$el$2 3 $ d e $ J u n i o $ d e $ 1 9 8 0 $ y$con$domicilio$en$calle$Buenavista,$núm.$8 ,$piso$y$puerta$3 C ,$ código$postal$28660,$Boadilla$del$Monte,$Madrid,$ $ teléfono$679950776$ ,$dirección$de$correo$electrónico$j o s e c a r l o s . c o r t i z o @ b r a i n s i n s . c o m $ $ & & MANIFIESTO&QUE& $ como$COAUTOR&NO&DOCTOR,$estoy$informado/a$que$el$señor..........Enrique&Puertas&Sanz........................$ quiere$solicitar$la$autorización$a$la$Comisión$Académica$del$Programa$Doctorado$de$la$Universidad$Europea$ de$Madrid$para$la$presentación$de$su$tesis$doctoral$en$forma$de$compendio$de$publicaciones$y$que$renuncio$ como$coautor/a$a$su$presentación$como$parte$de$otra$tesis$doctoral.$ $ Y,$a$este$efecto$ $ $ HAGO&CONSTAR$ $ Que$acepto$que$se$utilice$el$trabajo$/$los$trabajos$siguientes$(indique&todos&y&cada&uno&de&los&trabajos&en&los& que&consta&como&coautor)$$ $ • Email&Spam&Filtering& & para$ la$ presentación$ de$ su$ tesis$ doctoral$ en$ la$ Universidad$ Europea$ de$ Madrid$ en$ forma$ de$ compendio$ de$ publicaciones$ $ Firma$$ $ $ $ $ En$Boadilla$del$Monte,$a$17$de$Junio$de$2013$ $ $ $

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.