Word-Class Lattices (WCLs)

Word-Class Lattices (WCLs) are a generalization of word lattices developed by Roberto Navigli and Paola Velardi to model textual definitions. Our classifiers, based on two variants of WCLs, are able to identify definitions and extract hypernyms with high accuracy.

WCL Java API
Datasets
References

WCL Java API

We release here our implementation of Word-Class Lattices, available as a Java API download. The WCL classifier can easily be used programmatically in any Java project.

In the code snippet below we show an example of API usage. After the selection of the target language (line 15), we load the training dataset for the target language (line 23). Then an instance of WCLClassifier is created (line 25) and the training phase is launched on the input training corpus (line 26). Now the classifier is ready to be tested on any given sentence in the target language (lines 28-32). If the classifier output is positive we can print the extracted hypernym (line 34). The output of the presented code is the string "classifier" which corresponds to the hypernym extracted by WCL for the input sentence "WCL is a classifier".

		import it.uniroma1.lcl.jlt.util.Language;
		import it.uniroma1.lcl.wcl.data.dataset.AnnotatedDataset;
		import it.uniroma1.lcl.wcl.data.dataset.Dataset;
		import it.uniroma1.lcl.wcl.data.sentence.Sentence;
		import it.uniroma1.lcl.wcl.classifiers.lattice.TripleLatticeClassifier;
		import it.uniroma1.lcl.wcl.classifiers.lattice.WCLClassifier;
		import it.uniroma1.lcl.wcl.data.sentence.SentenceAnnotation;
		import java.io.IOException;

		public class Test
		{
			public static void main(String[] args)
			{
				// select the language of interest
				Language targetLanguage = Language.EN; 
				String trainingDatasetFile ="data/training/wiki_good.EN.html";
				Dataset ts;
					  
				// open the training set 
				try
				{
					// load the training set for the target language 
					ts = new AnnotatedDataset(trainingDatasetFile, targetLanguage);
					// obtain an instance of the WCL classifier
					WCLClassifier c = new TripleLatticeClassifier(targetLanguage);
					c.train(ts);
					// create a sentence to be tested
					Sentence sentence = Sentence.createFromString("WCL",
										"WCL is a classifier.", 
										targetLanguage);
					// test the sentence 
					SentenceAnnotation sa = c.test(sentence);
					// print the hypernym
					if (sa.isDefinition()) System.out.println(sa.getHyper());
				}
				catch (IOException e)
				{
					e.printStackTrace();
				}
			}
		}

Datasets

Manually annotated English training dataset:

WCL datasets v1.2 [revision 29 July 2013].
WCL datasets v1.1 [revision 21 May 2011].
WCL datasets v1.0 [version 2 Aug 2010].

The above manually annotated WCL datasets are described in ^[2], with some linguistic analysis, and used in ^[1] to perform an experimental evaluation of WCLs. We are releasing a package that contains two folders: wikipedia, ukwac. The wikipedia folder contains the positive (wiki_good.txt) and negative (wiki_bad.txt) definition candidates extracted from Wikipedia. The ukwac folder contains candidate definitions for over 300,000 sentences from the ukWaC Web corpus (ukwac_testset.txt) in which occur any of 239 domain terms selected from the terminology of four different domains (ukwac_terms.txt). To estimate recall, we manually checked 50,000 of these sentences and identified 99 definitional sentences (ukwac_estimated_recall.txt).

Automatically annotated training datasets from Wikipedia:

EN wikipedia dataset [version 29 July 2013] (1,556,325 annotated definitions).
FR wikipedia dataset [version 29 July 2013] (447,945 annotated definitions).
IT wikipedia dataset [version 29 July 2013] (291,311 annotated definitions)

The above automatically annotated training datasets were obtained from Wikipedia for three languages: English, French and Italian. The dataset creation procedure is described in ^[3].

References

¹ When citing the Word-Class Lattice algorithm and our experimental results, please refer to the following paper:

Roberto Navigli, Paola Velardi. Learning Word-Class Lattices for Definition and Hypernym Extraction. Proc. of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), Uppsala, Sweden, July 11-16, 2010, pp. 1318-1327

² When referring to the manually-created dataset only, please cite the following paper:

Roberto Navigli, Paola Velardi, Juana María Ruiz-Martínez. An Annotated Dataset for Extracting Definitions and Hypernyms from the Web. Proc. of LREC 2010, Valletta, Malta, May 19-21, 2010, pp. 3716-3722

³ When referring to the automatically-created dataset and or the WCL API, please cite the following paper:

Stefano Faralli, Roberto Navigli. A Java Framework for Multilingual Definition and Hypernym Extraction. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Sofia, Bulgaria, August 4-9, pp. 103-108.

Last update: 29 July 2013 by Stefano Faralli