writeprints-static

Extract lexical and syntactic features for authorship attribution research.

The writeprints-static package aims to reproduce "Writeprints-Static" featureset described in Brennan et al. (2012). The API imitates that used by scikit-learn's text feature extraction classes (e.g., CountVectorizer. The Writeprints-Static featureset contains the lexical and syntactic subset of the original Writeprints feature set first proposed by Abbasi et al. (2008).

Group	Category	No. of Features	Description
Lexical	Word Level	3	Total words, average word length, number of short words
	Character Level	3	Total char, percentage of digits, percentage of uppercase letters
	Special Character	21	Occurrence of special characters
	Letters	26	Letter frequency
	Digits	10	Digit frequency
	Character Bigram	39	Percentage of common bigrams
	Character Trigram	20	Percentage of common trigrams
	Vocabulary Richness	2	Ratio of hapax legomena and dis legomena
Syntactic	Function Words	403	Frequency of function words
	POS Tags	22	Frequency of Parts of speech tag
	Punctuation	8	Frequency and percentage of colon, semicolon, qmark, period, exclamation, comma

We try to reproduce the Writeprints-Static feature set as finely as possible. The main resource we rely on to recover the Writeprints Static feature set is Table II (p. 12: 12) of Brennan et al. (2012). When it comes to uncertainty, we refer to the documentation of Jstylo, which is a privacy enhancing tool sharing multiple authors to Brennan et al. (2012). Technical details and known differences are stated under features' docstrings. For more details, see Features.

Installation

pip install writeprints-static
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl

Getting started

writeprints-static.WriteprintsStatic implements the Writeprints-Static featureset (Brennan et al. 2012). The interface follows conventions found in scikit-learn.

The following demonstrates how to extract Writeprints Static features.

from writeprints_static import WriteprintsStatic

texts = ["Colorless green ideas sleep furiously.", "Furiously sleep ideas green colorless.", 'James, while John had had "had", had had "had had"; "had had" had had a better effect on the teacher.']

vec = WriteprintsStatic()

# The input only accepts list of English string, so there is no need to specify input type as usually did for
# scikit-learn.
# Output X is a scipy.sparse.csr_matrix instance
X = vec.transform(texts)

# to check the feature values
X.toarray()

# to check the feature names
vec.get_feature_names()

Requirements

Python 3.8+ and the following packages are required.

spacy 3.0.8
scipy 1.5.0+
numpy 1.0.0+

Important links

Documentation: https://literary-materials.github.io/writeprints-static/
Source code: https://github.com/literary-materials/writeprints-static/writeprints_static
Issue tracker: https://github.com/literary-materials/writeprints-static/issues

License

This package is licensed under the ISC License.

Versions

May 4, 2022, launch v.0.0.1.

References

Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2), 1-29.
Brennan, M., Afroz, S., & Greenstadt, R. (2012). Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15(3), 1-22.
Overdorf, R., & Greenstadt, R. (2016). Blogs, Twitter feeds, and Reddit comments: Cross-domain authorship attribution. Proceedings on Privacy Enhancing Technologies, 2016(3), 155-171.