Lemmatization Table based on OpenCorpora Morphological Dictionary
Source:R/hash_lemmas_opencorpora.R
hash_lemmas_opencorpora.Rd
A lemmatization table generated from filtered OpenCorpora morphological dictionary. This table can be useful for simple and fast
"word form-to-lemma" replacement, for example using lemmatize_strings
from textstem package.
Table contains 3049772
words forms of
376782
lemmas.
Usage
data(hash_lemmas_opencorpora)
Format
A data table with 3049772 rows and 2 variables:
- token
a textual token (word) inflected by affixes
- lemma
a base form or lemma
Details
The lemmatization table was generated from original OpenCorpora morphological dictionary after filtering unique pairs of token-lemma and unique tokens (first occurrence for both).
License
The base morphological dictionary of OpenCorpora is published under Creative Commons "Attribution-ShareAlike" 3.0 Unported License (CC BY-SA 3.0).
References
OpenCorpora project web-page: http://opencorpora.org