May 11, 2017, 12:15 p.m.

Title: Improvements on Re-Pair grammar compressor

Authors: Michał Gańczorz and Artur Jeż

Speaker: Michał Gańczorz


In this paper we propose a heuristical grammar compressor, based on the well-known RePair algorithm. As Re-Pair, it recursively chooses a digram of letters to be replaced with a new symbol; while Re-Pair chooses the most often digram, we propose to also employ penalties, if the chosen digrams are not coherent with a Lempel-Ziv factorisation. The idea of such penalties is motivated by analysis of recent approximation algorithms for this problem. We experimentally evaluate the proposed heuristics on established corpora, obtaining smaller grammars than the one generated by Re-Pair for English (and similar) text data and WebGraph, unfortunately for other types of data no improvement was obtained.