Fast compressed-based strategies for author profiling of social media texts

CERI 2016: 14
Fast compressed-based strategies for author profiling of social media texts
Francisco Claude, Roberto Konow, Susana Ladra
Categories
eBay Authors
Abstract

Given a text, it may be useful to determine the age, gender, native language, nationality, personality and other demographic attributes of its author. This task is called author profiling, and has been studied by different areas, especially from linguistics and natural language processing, by extracting different content- and style-based features from training documents and then using various machine learning approaches.

In this paper we address the author profiling task by using several compression-inspired strategies. More specifically, we generate different models to identify the age and the gender of the author of a given document without analysing or extracting specific features from the textual content, making them style-oblivious approaches.

We compare and analyse their behaviour over datasets of different nature. Our results show that by using simple compression-inspired techniques we are able to obtain very competitive results in terms of accuracy and we are orders of magnitude faster for the evaluation phase when compared to other state-of-the-art complex and resource-demanding techniques.

Another publication from the same author: Roberto Konow

Information Systems 60: 34-49 (2016)

Aggregated 2D range queries on clustered points.

Nieves R. Brisaboa, Guillermo de Bernardo, Roberto Konow, Gonzalo Navarro, Diego Seco

Efficient processing of aggregated range queries on two-dimensional grids is a common requirement in information retrieval and data mining systems, for example in Geographic Information Systems and OLAP cubes. We introduce a technique to represent grids supporting aggregated range queries that requires little space when the data points in the grid are clustered, which is common in practice. We show how this general technique can be used to support two important types of aggregated queries, which are ranked range queries and counting range queries. Our experimental evaluation shows that this technique can speed up aggregated queries up to more than an order of magnitude, with a small space overhead.

Keywords
Categories

Another publication from the same category: Other

HotCloud '15, 7th USENIX Workshop on Hot Topics in Cloud Computing, Santa Clara July 2015

The Importance of Features for Statistical Anomaly Detection

David Goldberg, Yinan Shan

The theme of this paper is that anomaly detection splits into two parts: developing the right features, and then feeding these features into a statistical system that detects anomalies in the features. Most literature on anomaly detection focuses on the second part. Our goal is to illustrate the importance of the first part. We do this with two real-life examples of anomaly detectors in use at eBay.

Keywords
Categories