writeprints-static
Extract lexical and syntactic features for authorship attribution research.
The writeprints-static package aims to reproduce "Writeprints-Static" featureset described in Brennan et al. (2012). The API imitates that used by scikit-learn's text feature extraction classes (e.g., CountVectorizer. The Writeprints-Static featureset contains the lexical and syntactic subset of the original Writeprints feature set first proposed by Abbasi et al. (2008).
Group | Category | No. of Features | Description |
---|---|---|---|
Lexical | Word Level | 3 | Total words, average word length, number of short words |
Character Level | 3 | Total char, percentage of digits, percentage of uppercase letters | |
Special Character | 21 | Occurrence of special characters | |
Letters | 26 | Letter frequency | |
Digits | 10 | Digit frequency | |
Character Bigram | 39 | Percentage of common bigrams | |
Character Trigram | 20 | Percentage of common trigrams | |
Vocabulary Richness | 2 | Ratio of hapax legomena and dis legomena | |
Syntactic | Function Words | 403 | Frequency of function words |
POS Tags | 22 | Frequency of Parts of speech tag | |
Punctuation | 8 | Frequency and percentage of colon, semicolon, qmark, period, exclamation, comma |
We try to reproduce the Writeprints-Static feature set as finely as possible. The main resource we rely on to recover the Writeprints Static feature set is Table II (p. 12: 12) of Brennan et al. (2012). When it comes to uncertainty, we refer to the documentation of Jstylo, which is a privacy enhancing tool sharing multiple authors to Brennan et al. (2012). Technical details and known differences are stated under features' docstrings. For more details, see Features.
Installation
pip install writeprints-static
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
Getting started
writeprints-static.WriteprintsStatic implements the Writeprints-Static featureset (Brennan et al. 2012). The interface follows conventions found in scikit-learn.
The following demonstrates how to extract Writeprints Static features.
from writeprints_static import WriteprintsStatic
texts = ["Colorless green ideas sleep furiously.", "Furiously sleep ideas green colorless.", 'James, while John had had "had", had had "had had"; "had had" had had a better effect on the teacher.']
vec = WriteprintsStatic()
# The input only accepts list of English string, so there is no need to specify input type as usually did for
# scikit-learn.
# Output X is a scipy.sparse.csr_matrix instance
X = vec.transform(texts)
# to check the feature values
X.toarray()
# to check the feature names
vec.get_feature_names()
Requirements
Python 3.8+ and the following packages are required.
- spacy 3.0.8
- scipy 1.5.0+
- numpy 1.0.0+
Important links
- Documentation: https://literary-materials.github.io/writeprints-static/
- Source code: https://github.com/literary-materials/writeprints-static/writeprints_static
- Issue tracker: https://github.com/literary-materials/writeprints-static/issues
License
This package is licensed under the ISC License.
Versions
- May 4, 2022, launch v.0.0.1.
References
-
Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2), 1-29.
-
Brennan, M., Afroz, S., & Greenstadt, R. (2012). Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15(3), 1-22.
-
Overdorf, R., & Greenstadt, R. (2016). Blogs, Twitter feeds, and Reddit comments: Cross-domain authorship attribution. Proceedings on Privacy Enhancing Technologies, 2016(3), 155-171.