Lab of Advanced Algorithms and Applications

The Lab of Advanced Algorithms and Applications (A³, read "a-cube") is mainly devoted to the design, analysis and experimentation of algorithms and data structures for storing, compressing, mining and retrieving information from large amounts of textual data like Web repositories, XML file collections, textual databases, genomic/DNA sequences.


We are developing a new semantic-annotation technology for short textual fragments, called TAGME [ACM CIKM 2010, IEEE Software 2012]. This tool has been applied succesfully to many contexts concerning with the clustering, the classification and the similarity-comparison of short texts.

We are also studying the design of of “Multi-objective” data compressor, a novel paradigm in which optimization techniques are plugged into the design of data compressors. For example, in the Bicriteria Data Compression ([ACM-SIAM SODA14], to appear), the compressor solves the following problem: "return a compressed file which minimizes the compressed space provided that the decompression time is less than T". Similarly, the compressor can "return a compressed file which minimizes the decompression time provided that the compressed space is less than S bits".

A third line of research is a classic one for our group lasting for 20 years now and regarding the design of compressed indexes for big data (such as String B-tree [JACM 1999] and FM-index [JACM 2005]). Recently we have designed cache-oblivious compressed version of those indexes for dictionaries of strings [ESA 2013], and distribution-ware compressed indexes for data collections [Algorithmica 2013].

Current Partnerships

  • Bassilichi
  • Google Zurich
  • Net7
  • SpazioDati
  • Tiscali Italia

Current Grants

  • [2013-2015] Regione Toscana, Net7, StudioFlu, SpazioDati
  • [2013-2014] Bassilichi
  • [2013-2014] Google Research Award
  • [2012-2014] Italian MIUR-PRIN Project "ARS-Technomedia"
  • [2011-2012] Telecom Italia Working Capital
  • [2010] Google Research Award
  • [2010-2012] Italian MIUR-PRIN Project "The Mad Web"
  • [2006-2011] Yahoo! Research
  • [2009-2012] Italian MIUR-FIRB Project "Linguistica"

Latest News

WikipediaSimilarity411 dataset released » January 23, 2014

Darth Vader is closely related to the Death Star, but totally unrelated to Homeopathy. A relatedness function is a function that, given two Wikipedia pages, returns their relatedness in a [0, 1] range. Many disambiguation techniques are based on a relatedness function.

But how would you measure a relatedness function? A possible way is, given a set of Wikipedia-pages pairs, to ask humans the relatedness between those pairs of pages, and check the output of the relatedness function against the relatedness found by humans.

Such a dataset is Milne&Witten’s WikipediaSimilarity353 dataset. Unfortunately, since Wikipedia is ever-growing, the dataset became obsolete, pointing to pages that don’t exist anymore. To address this issue, we cleaned it, updated a few references, and added a few pairs of pages.

We created a brand-new WikipediaSimilarity411 dataset anybody can use, which provides 411 Wikipedia-pages pairs. Please also note that the BAT-Framework has been updated and features the test of a relatedness function against this dataset.


“Bicriteria Data Compression” Accepted at SODA14! » December 1, 2013

We are pleased to announce that our paper “Bicriteria Data Compression” has been accepted at the ACM/SIAM Symposium On Discrete Algorithms (SODA) 14! The work will be illustrated on January 7, 4PM, Galleria North – Ballroom Level, Hilton Portland & Executive Tower.

For more information, please consult our page devoted to the subject or the preprint published on arxiv.

TAGME service reaches 100M queries! » November 12, 2013

We are pleased to announce that our TAGME API service has been hit more than 100 millions in about 2 years. We are currently providing access for more than 100 users and sometimes the service has been able to handle more than 1 million queries per day.

Thank you very much to all users that have provided their valuable feedback.

bc-zip dataset » October 3, 2013

Here we publish the dataset used in our “Bicriteria Data Compression” paper (arxiv version).

Each file is a chunk of 1GB (2^30 bytes) extracted from the following sources: