icon

writeprints-static

Extract lexical and syntactic features for authorship attribution research.

The writeprints-static package aims to reproduce "Writeprints-Static" featureset described in Brennan et al. (2012). The API imitates that used by scikit-learn's text feature extraction classes (e.g., CountVectorizer. The Writeprints-Static featureset contains the lexical and syntactic subset of the original Writeprints feature set first proposed by Abbasi et al. (2008).

Group Category No. of Features        Description        
Lexical Word Level 3 Total words, average word length, number of short words
Character Level 3 Total char, percentage of digits, percentage of uppercase letters
Special Character 21 Occurrence of special characters
Letters 26 Letter frequency
Digits 10 Digit frequency
Character Bigram 39 Percentage of common bigrams
Character Trigram 20 Percentage of common trigrams
Vocabulary Richness 2 Ratio of hapax legomena and dis legomena
Syntactic Function Words 403 Frequency of function words
POS Tags 22 Frequency of Parts of speech tag
Punctuation 8 Frequency and percentage of colon, semicolon, qmark, period, exclamation, comma

We try to reproduce the Writeprints-Static feature set as finely as possible. The main resource we rely on to recover the Writeprints Static feature set is Table II (p. 12: 12) of Brennan et al. (2012). When it comes to uncertainty, we refer to the documentation of Jstylo, which is a privacy enhancing tool sharing multiple authors to Brennan et al. (2012). Technical details and known differences are stated under features' docstrings. For more details, see Features.

Installation

pip install writeprints-static
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl

Getting started

writeprints-static.WriteprintsStatic implements the Writeprints-Static featureset (Brennan et al. 2012). The interface follows conventions found in scikit-learn.

The following demonstrates how to extract Writeprints Static features.

from writeprints_static import WriteprintsStatic

texts = ["Colorless green ideas sleep furiously.", "Furiously sleep ideas green colorless.", 'James, while John had had "had", had had "had had"; "had had" had had a better effect on the teacher.']

vec = WriteprintsStatic()

# The input only accepts list of English string, so there is no need to specify input type as usually did for
# scikit-learn.
# Output X is a scipy.sparse.csr_matrix instance
X = vec.transform(texts)

# to check the feature values
X.toarray()

# to check the feature names
vec.get_feature_names()

Requirements

Python 3.8+ and the following packages are required.

  • spacy 3.0.8
  • scipy 1.5.0+
  • numpy 1.0.0+
  • Documentation: https://literary-materials.github.io/writeprints-static/
  • Source code: https://github.com/literary-materials/writeprints-static/writeprints_static
  • Issue tracker: https://github.com/literary-materials/writeprints-static/issues

License

This package is licensed under the ISC License.

Versions

  • May 4, 2022, launch v.0.0.1.

References

  • Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2), 1-29.

  • Brennan, M., Afroz, S., & Greenstadt, R. (2012). Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC), 15(3), 1-22.

  • Overdorf, R., & Greenstadt, R. (2016). Blogs, Twitter feeds, and Reddit comments: Cross-domain authorship attribution. Proceedings on Privacy Enhancing Technologies, 2016(3), 155-171.