Lemmatization Table based on OpenCorpora Morphological Dictionary

A lemmatization table generated from filtered OpenCorpora morphological dictionary. This table can be useful for simple and fast "word form-to-lemma" replacement, for example using lemmatize_strings from textstem package. Table contains 3049772 words forms of 376782 lemmas.

Usage

data(hash_lemmas_opencorpora)

Format

A data table with 3049772 rows and 2 variables:

token: a textual token (word) inflected by affixes
lemma: a base form or lemma

Source

http://opencorpora.org/files/export/dict/dict.opcorpora.txt.zip

Details

The lemmatization table was generated from original OpenCorpora morphological dictionary after filtering unique pairs of token-lemma and unique tokens (first occurrence for both).

License

The base morphological dictionary of OpenCorpora is published under Creative Commons "Attribution-ShareAlike" 3.0 Unported License (CC BY-SA 3.0).

References

OpenCorpora project web-page: http://opencorpora.org