11 maja 2017 12:15
Seminarium ZZOiAA: Michał Gańczorz, Improvements on Re-Pair grammar compressor
Title: Improvements on Re-Pair grammar compressor
Authors: Michał Gańczorz and Artur Jeż
Speaker: Michał Gańczorz
Abstract:
In this paper we propose a heuristical grammar compressor, based on the well-known RePair algorithm. As Re-Pair, it recursively chooses a digram of letters to be replaced with a new symbol; while Re-Pair chooses the most often digram, we propose to also employ penalties, if the chosen digrams are not coherent with a Lempel-Ziv factorisation. The idea of such penalties is motivated by analysis of recent approximation algorithms for this problem. We experimentally evaluate the proposed heuristics on established corpora, obtaining smaller grammars than the one generated by Re-Pair for English (and similar) text data and WebGraph, unfortunately for other types of data no improvement was obtained.