Affect data (distributed by Cecilia Ovesdotter Alm)
This affect data was collected as part of my dissertation research. Please cite my dissertation if you are using any of this data. You can download an abstract here in PDF format. My dissertation has been published as a book that is available from, for instance, Amazon. (Especially chapter three is informative in relation to the data.)
I. Annotated affect data for download
As seen in the readme.txt, this data set is released under the GNU General Public License. Please see the following notice. No liability or goodness guarantee is provided and you are downloading the data at your own risk.
Please decompress with gunzip filename followed by tar -xvf filename.
The affect label annotations with sentences as annotated are in the subdirectory emmood. The archives also include some other subdirectories, most of which are less interesting (but might be helpful for comparison for formatting some of the input data if using the affect tools which will be released at a later time). See the readme.txt file in the main directory for information on contents and format. Some texts by these authors can be found on Gutenberg's website (some with later revisions). Also, audio books have been released for these authors.
1. B. Potter (tar.gz file) [19 stories]
2. H.C. Andersen (tar.gz file) [77 stories]
3. Grimm's (tar.gz file) [80 stories]
Additional note: Sentence split was automatic (treatment of source texts' end-of-line hyphenation differed). Besides some preprocessing (e.g. special whitespace like song indentation was not maintained, and some dedications were removed), some directories have files with the suffix/infix okpuncs, meaning the sentences were additionally processed (e.g. before part-of-speech, POS, tagging of sentences) to some degree in a subcorpus-dependent manner but to some degree developed more on Potter (completing double quotes at the sentence level - multisentence quote spans were not considered, removing certain edge punctuation or white space, modifying thought dashes, and revising a quote-apostrophe mixup in Grimm's from preprocessing and split hyphenated words in Potter). This was done with heuristic rules so some inconsistencies remained or were introduced. Similarly, certain POS tags are not satisfactory. Also, some peculiarities of the original texts were kept (e.g. quotation usage in H.C. Andersen). Overall, the included files are not expected to be noise-free.
II. High agree affect data by author for viewing
I only provide these documents for convenient viewing and printing. Note that the high agree data is also included in the above compressed files in simple text format, separated by stories.
These documents list only sentences with AFFECTIVE HIGH AGGREMENTS, i.e. sentences with four identical affective labels. The merged label set with six affect classes was used: Angry-Disgusted (code: 2; merged), Fearful (code: 3), Happy (code: 4), Sad (code: 6), and Surprised (code: 7; merged). Note that since the HighAgree subcorpus considered affective labels, sentences with four Neutral labels are NOT included. The same disclaimer and notice as for the above .tar.gz files applies to these documents.
In these files, a storyname is followed by its corresponding high agree affective sentences in the following format:
35@3@"It is very unpleasant, I am afraid of the police," said Pickles.
Explanation: The sentence "It is very unpleasant, I am afraid of the police," said Pickles. from the story Ginger and Pickles had four fearful affect labels. It's sentence-id was #35 (i.e. it was the 36th sentence in the story). It is followed by another high agree affect sentence from the same story, before the next story's entries begin.
1. B. Potter, high agree sentences (txt file) (doc file)
2. H.C. Andersen, high agree sentences (txt file) (doc file)
3. Grimm's, high agree sentences (txt file) (doc file)
III. Other annotations
My classification experiments did not use these other annotations, and the below files are not part of the dataset, but the same disclaimer and notice as linked above applies to these documents. They contain also the other annotations (incl. the optional ones): feeler, intensity (could also indicate secondary or less dominant emotions), and lists of connotative/emotional words/phrases marked as assisting emotion annotation. They are made available here for additional inspection just as they are (e.g. some missing, some optional, some design flaws etc.). There are subdirectories by annotator. The files' header is self-explanatory, but ignore the POS column (instead see POS in the above .tar.gz files). For annotating feeler, annotators generally had rough character lists for stories, with canonical characters (e.g. 'villain', 'hero', 'step mother', 'mc' or main character, etc.) and story-specific character names (plus they could use 'reader' and 'other'). For at least two longer stories individual character sets were improvised. The connotative/emotional words/phrases were marked-up by word number, and duplicate entries or words entered out of order occur (perhaps slight word split differences could have occured when printing words, although this was not necessarily seen, and entries from the 100th word were not included).
1. B. Potter other (tar.gz file)
2. H.C. Andersen other (tar.gz file)
3. Grimm's other (tar.gz file)
(In case you are interested in the python-cpickled data annotation files, from which above text files for specific annotations were extracted, these are here: B. Potter preextract (tar.gz file), H.C.Andersen preextract (tar.gz file), Grimm's preextract (tar.gz file). The variable names of sentence objects in lists naturally differ from above annotation names. These are also not part of the data set. The same notice as above applies here.)
My dissertation and the readme should answer most questions. Please do consult those first. If you'd find a need to ask for other information not provided in the dissertation or in the readme, you may try contacting me by e-mail: cissioalm@ifyouarehumanreplacethiswithgmaildotcom