Features
Please refer to the source code's docstrings for up-to-date description where the feature definition, technical details, and known differences from Writeprints-Static of Brennan, Afroz, and Greenstadt (2012) are included. A docstring example shows below.
"""avg_word_length
Counts the average number of characters for words in the text.
The length the concatenation of all words over "total words" is counted.
Known differences with Writeprints Static feature "average word length": None.
Args:
word_tokens: List of lists of token.text in spaCy doc instances.
Returns:
Average length of words in the document.
"""
Caveat
The writeprints-static package uses spaCy 2.x's default tokenizer
under the hood.
There are many tokenizers that define a word token differently, which will induce consequences on the calculation of other
features based on the definition of word token.
For instance, NLTK 3.5's default tokenizer,
(word_tokenize
),
disagree with spaCy's in aspects.