COMM2: A Test Set of AAC-like Communications
----------------------------------------------------------
https://aactext.org/comm2/

This is a test set generated using a crowdsourced task in which workers invented
statements and questions about 10 different communication situations.

Further details about the collection methodology and analysis can be found in this ASSETS '13 poster:
  @inproceedings{vertanen_comm2,
    author       = {Keith Vertanen},
    title        = {A Collection of Conversational AAC-like Communications},
    booktitle    = {ASSETS '13: Proceedings of the ACM SIGACCESS Conference on Computers and Accessibility},
    year         = {2013}, 
  }

Our procedure follows the one described in this paper:
  @article{venkatagiri_efficient, 
    author       = "Horabail Venkatagiri", 
    title        = "Efficient keyboard layouts for sequential access in augmentative and alternative communication",     
    journal      = "Augmentative and Alternative Communication",
    volume       = {15},
    number       = {2},    
    year         = {1999}, 
    pages        = {126--134}, 
  }

Venkatagiri's original test set is available here (provided with permission):
  http://www.aactext.org/imagine/

The main difference in our COMM2 collection is the quantity of communications: 
1,506 communications in our set versus 260 in Venkatagiri's original collection.

We have provided the collected communications after we manually reviewed and corrected
obvious spelling and grammar mistakes.  We have provided the original mixed case 
communications which include numbers and punctuation.  We have also provided various
filtered version that may be useful depending on the limitations of the interface being
tested.

The COMM2 collection is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License, http://creativecommons.org/licenses/by-nd/3.0/

Details of included files
-------------------------
readme.txt                   - This file

comm2.txt                    - Test set, mixed case, first column is a unique ID, second is the text
comm2_nonum.txt              - Test set, mixed case, excluding numbers 0-9
comm2_nonum_nopunc.txt       - Test set, mixed case, excluding numbers 0-9, stripping punctuation (other than apostrophe)            
comm2_nonum_ending.txt       - Test set, mixed case, excluding numbers 0-9, stripping punctuation except final ending .?!

comm2_lower.txt              - Test set, lower case
comm2_lower_nonum.txt        - Test set, lower case, excluding numbers 0-9
comm2_lower_nonum_nopunc.txt - Test set, lower case, excluding numbers 0-9, stripping punctuation (other than apostrophe)            
comm2_lower_nonum_ending.txt - Test set, lower case, excluding numbers 0-9, stripping punctuation except final ending .?!

comm2_vocab.txt              - Alphabetical list of the 1682 unique words occurring in the communications (ignoring case and stripping punctuation, dropping communications with 0-9)
lm_comm2_1gram.arpa          - Unigram maximum likelihood language model trained on lower case data, stripping punctuation and dropping communications with 0-9
train_nonum_nopunc.txt       - Just the text data without unique prompt IDs, used to train the language models

comm.html                    - HTML + JavaScript page that we used for our data collection task
comm_demo.html               - Demo version that can run locally outside of Amazon Mechanical Turk. NOTE: does not save any of the input.

The language model was trained with the following SRILM command:
  % ngram-count -order 1 -lm lm_comm2_1gram.arpa -gt1max 0 -gt1min 0 -text train_nonum_nopunc.txt 

Have fun!
Keith Vertanen

Revision history
----------------
Jun 20, 2013    First release of COMM2 test set.
Jun 10, 2019    Updated to include standalone HTML file that works outside Amazon Mechanical Turk for demo purposes.