Definition extraction from legislation

IMAG0515 Publishers used to sell just texts. With the arrival of the internet, they put some of these texts online. That worked for a while. But in these days of the Google Knowledge Graph, Wolfram Alpha and IBM Watson, selling texts is not sufficient anymore, whatever the channel. What publishers need to sell is knowledge.

This argument holds true in particular for texts like legislation, which are often freely available online. As a publisher, you need to add more value to these texts before you can package and sell them. Lately I’ve therefore been experimenting with several types of content enrichment. One example is the automatic extraction of definitions.

Legislation is riddled with definitions. Rules must be clear, and therefore terms must be defined rigourously. Luckily, most definitions are easy to spot. They are located in the first few articles of a law, often display clear lexical patterns like “wordt verstaan onder” (is understood as) in Dutch, and typically mark the defined term by quotation marks. This makes them a rewarding target for automatic extraction.

Last week I put together a simple rule-based definition extraction engine for Dutch. At its core are a few basic pre-processing steps that I plugged in from NLTK: the input text is split up into individual sentences, the sentences are tokenized (divided into words), these tokens are tagged (labelled with their parts of speech like “noun” or “verb”), and these tokens are grouped into chunks (higher-level syntactic units like NPs, noun phrases, and VPs, verb phrases). This information is used as the input for the actual extraction process.

The extraction process couldn’t be simpler. So far I’ve experimented with a limited set of extraction rules: the engine looks for patterns like noun phrases with quotation marks and sentence structures like “NP is NP that” or “NP is understood as”. Still, when I test this limited rule set on a fairly large selection of Flemish, Belgian and European legislation in Dutch, it throws up almost 30,000 definitions!

Although I had expected to see a considerable number of definitions, I was surprised by their quantity and their variety. They range from everyday concepts, like

“wasem”: een laag condens op het binnenoppervlak van de voorruit
(“vapor”: a layer of condensate on the inner surface of the windscreen)

to fairly exotic ones, like

“geurbordspel”: speelgoed met behulp waarvan een kind verschillende parfums of geuren leert te herkennen
(“olfactory board game”: a toy through which a child learns to recognize different perfumes or scents)

and from typically legislative concepts, like

“verschuldigd bedrag”: de hoofdsom die binnen de contractuele of wettelijke betalingstermijn had moeten worden voldaan, inclusief toepasselijke belastingen, rechten, heffingen of kosten als vermeld in de factuur of in een gelijkwaardig verzoek tot betaling
(“amount due”: the principal sum that should have been paid within the contractual or statutory period of payment, including any applicable taxes, duties, levies or charges specified in the invoice or an equivalent request for payment)

to very scientific ones, like

“meter”: de meter is de lengte van de weg die het licht in vacuüm aflegt in een tijd van 1/299.792.458 seconde
(“meter”: the meter is the length of the path traveled by light in vacuum in a time of 1/299.792.458 second)

Some terms are unequivocal, but others have different definitions in different pieces of legislation, which poses an interesting problem for their disambiguation in other texts. Most results are correct definitions, but some are plain wrong, as it’s difficult to create rules that correctly identify definitions 100% of the time. Some manual intervention would be helpful to filter the results.

It’s clear the simple extraction engine described here is only a first step. It’s a step towards more advanced extraction processes (rule-based and statistical), but mainly it’s a step towards more intelligent products that offer people not just texts, but knowledge. Publishers should anticipate people’s questions, and offer them the answers they are looking for. In a world where more and more content is freely available, I believe that’s the only way they can stay relevant.

Leave a comment