Word-Class Lattices (WCLs)

Word-Class Lattices (WCLs) are a generalization of word lattices developed by Roberto Navigli and Paola Velardi to model textual definitions. Our classifiers, based on two variants of WCLs, are able to identify definitions and extract hypernyms with high accuracy.

WCL Java API

We release here our implementation of Word-Class Lattices, available as a Java API download. The WCL classifier can easily be used programmatically in any Java project.

In the code snippet below we show an example of API usage. After the selection of the target language (line 15), we load the training dataset for the target language (line 23). Then an instance of WCLClassifier is created (line 25) and the training phase is launched on the input training corpus (line 26). Now the classifier is ready to be tested on any given sentence in the target language (lines 28-32). If the classifier output is positive we can print the extracted hypernym (line 34). The output of the presented code is the string "classifier" which corresponds to the hypernym extracted by WCL for the input sentence "WCL is a classifier".
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import it.uniroma1.lcl.jlt.util.Language;
import it.uniroma1.lcl.wcl.data.dataset.AnnotatedDataset;
import it.uniroma1.lcl.wcl.data.dataset.Dataset;
import it.uniroma1.lcl.wcl.data.sentence.Sentence;
import it.uniroma1.lcl.wcl.classifiers.lattice.TripleLatticeClassifier;
import it.uniroma1.lcl.wcl.classifiers.lattice.WCLClassifier;
import it.uniroma1.lcl.wcl.data.sentence.SentenceAnnotation;
import java.io.IOException;
 
public class Test
{
    public static void main(String[] args)
    {
        // select the language of interest
        Language targetLanguage = Language.EN;
        String trainingDatasetFile ="data/training/wiki_good.EN.html";
        Dataset ts;
               
        // open the training set
        try
        {
            // load the training set for the target language
            ts = new AnnotatedDataset(trainingDatasetFile, targetLanguage);
            // obtain an instance of the WCL classifier
            WCLClassifier c = new TripleLatticeClassifier(targetLanguage);
            c.train(ts);
            // create a sentence to be tested
            Sentence sentence = Sentence.createFromString("WCL",
                                "WCL is a classifier.",
                                targetLanguage);
            // test the sentence
            SentenceAnnotation sa = c.test(sentence);
            // print the hypernym
            if (sa.isDefinition()) System.out.println(sa.getHyper());
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

Datasets

Manually annotated English training dataset: The above manually annotated WCL datasets are described in [2], with some linguistic analysis, and used in [1] to perform an experimental evaluation of WCLs. We are releasing a package that contains two folders: wikipedia, ukwac. The wikipedia folder contains the positive (wiki_good.txt) and negative (wiki_bad.txt) definition candidates extracted from Wikipedia. The ukwac folder contains candidate definitions for over 300,000 sentences from the ukWaC Web corpus (ukwac_testset.txt) in which occur any of 239 domain terms selected from the terminology of four different domains (ukwac_terms.txt). To estimate recall, we manually checked 50,000 of these sentences and identified 99 definitional sentences (ukwac_estimated_recall.txt).

Automatically annotated training datasets from Wikipedia: The above automatically annotated training datasets were obtained from Wikipedia for three languages: English, French and Italian. The dataset creation procedure is described in [3].

References

1 When citing the Word-Class Lattice algorithm and our experimental results, please refer to the following paper:

Roberto Navigli, Paola Velardi. Learning Word-Class Lattices for Definition and Hypernym Extraction. Proc. of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, July 11-16, 2010, pp. 1318-1327

2 When referring to the manually-created dataset only, please cite the following paper:

Roberto Navigli, Paola Velardi, Juana María Ruiz-Martínez. An Annotated Dataset for Extracting Definitions and Hypernyms from the Web. Proc. of LREC 2010, Valletta, Malta, May 19-21, 2010, pp. 3716-3722

3 When referring to the automatically-created dataset and or the WCL API, please cite the following paper:

Stefano Faralli, Roberto Navigli. A Java Framework for Multilingual Definition and Hypernym Extraction. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Sofia, Bulgaria, August 4-9, pp. 103-108.


Last update: 29 July 2013 by Stefano Faralli