NASARI: a Novel Approach to a Semantically-Aware Representation of Items

NASARI semantic vector representations for BabelNet synsets* and Wikipedia pages in several languages. Currently available three vector types: lexical, unified and embedded. NASARI provides a large coverage of concepts and named entities and has been proved to be useful for many Natural Language Processing tasks such as multilingual semantic similarity, sense clustering or word sense disambiguation, tasks on which NASARI has contributed to achieve state-of-the-art results on standard benchmarks.

*Please note that BabelNet covers WordNet and Wikipedia among other resources, enabling our vectors to be applicable for representations of concepts and named entities in each of these resources.

Downloads

Currently available for English, Spanish, French, German and Italian. Please find more information in the README file. Stay tuned for the release of NASARI representations in other languages! The NASARI embed vectors below share the same space with the pre-trained vectors of Word2Vec for English (trained on the Google News corpus), or with the Word2Vec word embeddings trained on the Spanish Billion Words Corpus for Spanish (more information in the main reference paper). You can download the Spanish word embeddings here.

New (July 2017): Now you can additionally download the 300-dimensional NASARI-embed concept and entity BabelNet synset embeddings along with the Word2Vec word embeddings trained on the UMBC corpus, both in the same vector space. These vectors tend to show a superior performance than the NASARI-embed vectors trained on Google News below. Download both NASARI-embed and the UMBC word embeddings here (note that in this version all word embeddings are lowercased):

NASARI-embed and UMBC word embeddings in a single compressed bin file (compatible with Word2Vec and gensim): [7GB]
NASARI-embed and UMBC word embeddings in txt from a compressed zip file: [8GB]

Note: the first three lines of the table below correspond to the NASARI vector representations for all English Wikipedia pages (Wikipedia dump of November 2014). In the remaining files each vector is tagged with its corresponding BabelNet synset and Wikipedia page.

Language	Type	# of BabelNet synsets	# of Wikipedia pages	Download	Size
English	Lexical(Wiki)	-	4.40M		4.7GB
English	Embed(Wiki)	-	4.40M		5.9GB
English	Unified(Wiki)	-	2.85M		341MB
English	Lexical	4.42M	4.40M		4.7GB
English	Embed	4.42M	4.40M		5.9GB
English	Unified	2.87M	2.85M		352MB
Spanish	Lexical	1.07M	1.05M		705MB
Spanish	Unified	657K	635K		60MB
Spanish	Embed	1.07M	1.05M		1.4GB
French	Lexical	1.48M	1.45M		1.1GB
French	Unified	882K	861K		96MB
German	Lexical	1.51M	1.49M		1.4GB
German	Unified	857K	836K		59MB
Italian	Lexical	1.10M	1.08M		843MB
Italian	Unified	631K	610K		69MB

The NASARI_embed vector representations can also be downloaded in binary format: [bin:4.8GB] (compatible with Word2Vec). NASARI lexical vectors in English can also be downloaded in tar.bz2 compression format: [tar.bz2:3.6GB].
Please note that you can use the BabelNet API to get the most from these vectors, e.g., access the corresponding WordNet synsets or lexicalizations.

Release history

Current version: 3.0

Release version	Date	Features	Reference paper
1.0	April 2015	English lexical and unified vectors for WordNet synsets and Wikipedia pages.	NAACL 2015
2.0	August 2015	Multilingual extension through BabelNet. Available in English, Spanish, French, German and Italian.	ACL 2015
2.1	October 2015	Minor bug fixing and updated format.	-
3.0	March 2016	Improved lexical and unified vectors. Integration of embedding vector representations.	AIJ 2016

Main reference

If you use any of the resources available in this website, please refer to the following article [bib]:

José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli.
Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities.
Artificial Intelligence 240, Elsevier, 2016, pp.567-577.

@article{camacho2016nasari,
  title={Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities},
  author={Camacho-Collados, Jos{\'e} and Pilehvar, Mohammad Taher and Navigli, Roberto},
  journal={Artificial Intelligence},
  volume={240},
  pages={36--64},
  year={2016},
  publisher={Elsevier}
}

Previous reference papers

José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli.
NASARI: a Novel Approach to a Semantically-Aware Representation of Items.
In Proceedings of the North American Chapter of the Association of Computational Linguistics (NAACL 2015), Denver, USA, pp. 567-577, 2015

José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli.
A Unified Multilingual Semantic Representation of Concepts.
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), Beijing, China, July 27-29, pp. 741-751, 2015

Contact

Should you have any enquiries about any of the resources, please contact us:

José Camacho Collados |
Email
collados [at] di.uniroma1 [dot] it
Mohammad Taher Pilehvar |
Email
pilehvar [at] di.uniroma1 [dot] it
Roberto Navigli |
Email
navigli [at] di.uniroma1 [dot] it

NASARI is an output of the MultiJEDI ERC Starting Grant No. 259234. NASARI is licensed under a Creative Commons Attribution - Noncommercial - Share Alike 3.0 License.

Previous reference papers

Contact

Email

Email

Email