Home > A Crowdsourced Corpus of AAC-like Communications

We used Amazon Mechanical Turk to create a large set of fictional AAC-like communications. Workers were asked to invent communications as if they were using a scanning-style AAC interface for communication. Our AAC corpus contains approximately six thousand communications. We found our crowdsourced collection modeled conversational AAC better than datasets based on telephone conversations or newswire text. We leveraged our crowdsourced messages to intelligently select sentences from much larger sets of Twitter, blog and Usenet data. For details, see our paper.

Below you can download our corpus of communications, some of the test sets we used, and some of our trained language models. Language models are in ARPA text format. If you use this resource in your research, please reference:

  • Keith Vertanen and Per Ola Kristensson. The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL: 700-711, 2011. PDF  BibTeX

We thank Keith Trnka for allowing us to provide the Switchboard test set. We thank Horabail Venkatagiri for allowing us to provide the communication test set. Our specialists test set was created from the phrases suggested by AAC professionals on these pages at the University of Nebraska-Lincoln: page1  page2  page3  page4.

With the exception of lm_test_switch.txt and lm_test_comm.txt, the resources listed below are licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License.

Corpus:

Language models:

If you have questions or comments, contact

Page last updated: August 15, 2011