Looking for genre-specific corpora

Dear All,

I am doing some research in concept extraction from different types of texts or genres.
I am looking for free research corpora belonging to the following genres:

1) FAQs (I have already downloaded some small collections, but I would like to have a more comprehensive range of topics).
2) Chat logs transcripts (I have already downloaded the NPS Collection, 3 Codiac datasets and several smallish Many Eyes datasets)
3) Telephone conversation transcripts (missing)
4) Emails (I have already downloaded the Enron dataset  and a couple of junk mail collections)
5) Tweets (missing, apparently the Edinburgh’s Twitter corpus is not available any more)
6) Corporate weblogs (missing)

I will be glad to share all the links and related documentation, once I got all the genres in the list.

Thanks in advance for your suggestions.
Marina

6 comments for “Looking for genre-specific corpora

  1. marinasantini.ms@gmail.com
    22 May, 2011 at 12:43

    Summary of replies

    Many thanks to Laura Christopherson, Cohan Sujay Carlos, Vineet
    Yadav, Jason Teeple, Leslie Barrett, Joakim Nordström, Bob Kuhns,
    Dong Wang, Dave Lewis, John Tait, and Loredana Cerrato.

    Suggested Corpora and Resources in English if not stated otherwise
    (not all of them are free of charge)

    Genre-specific corpora:
    – Genre: SMS Messages = NUS SMS corpus:
    http://wing.comp.nus.edu.sg:8080/SMSCorpus/ (English / Chinese)

    – Genre: chatlogs = CODIAC chatlogs
    (http://data.eol.ucar.edu/codiac/dss/id=92.124;
    http://data.eol.ucar.edu/codiac/dss/id=88.044;
    http://data.eol.ucar.edu/codiac/dss/id=107.010)

    – Genre: chatlogs = Many Eyes datasets: some chatlogs can be found
    here:
    http://www-958.ibm.com/software/data/cognos/manyeyes/datasets

    – Genre: chats and switchboard conversations =
    Switchboard corpus and NPS chat corpus samples NLTK in NLTK data
    (http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml). The NPS
    chat corpus (http://faculty.nps.edu/cmartell/NPSChat.htm) is a POS
    tagged chat corpus and the switchboard corpus
    (http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html) is a telephonic
    conversation corpus.

    – The Linguistics Data Consortium has a good deal of telephone
    conversation – many files and a variety of languages. See
    http://www.ldc.upenn.edu/Catalog/byType.jsp#lexicon,%20speech,%20
    text (not for free)

    – Genre: blogs = The Corporate weblogs dataset in TREC datasets
    (http://ir.dcs.gla.ac.uk/test_collections/) is not for free. Helpful wiki:
    http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG
    – Genre: corporate blogs = It is possible to pull corporate blog feeds
    or scrape the blogs from this list:
    http://www.debbieweil.com/blog/list-of-67-big-brand-corporate-blogs/

    – The Göteborg Spoken Language Corpus and other corpora in
    Swedish (http://spraakbanken.gu.se/)

    – Genre: tweets = The twitter corpus associated with the paper
    http://www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf is
    here: https://sites.google.com/site/twittersentimenthelp/for-researchers

    – Genre: tweets and other microblogs= MicroBlog track
    http://sites.google.com/site/trecmicroblogtrack/ (not for free)

    – Genre: Newswires: Reuters’ Newswires collections =
    http://trec.nist.gov/data/reuters/reuters.html

    – Genre: emails = Enron corpus (http://www.cs.cmu.edu/~enron/);
    categorized Enron emails (http://sgi.nu/enron/corpora.php)

    – Genre: emails = Junk email corpus
    (http://clg.wlv.ac.uk/resources/junk-emails/index.php)

    – Genre: FAQs = 200 FAQs
    (http://www.itri.brighton.ac.uk/~Marina.Santini/#Download)

    Resources:
    – In terms of words and concept, there are two main resources for
    English. First is WordNet, originally from Princeton, it is in NLTK (and
    one can get it separately). It is English words ‘organized’ according to
    their relationships: synonym, hyponym, piece of a whole, etc. The other
    resource is Word Association Norms, one can get that from the
    University of South Florida (http://w3.usf.edu/FreeAssociation/).
    – Article: Hella Koo Finding: Twitter Dialect –
    http://blogs.wsj.com/ideas-market/2011/02/08/hella-koo-finding-twitter-
    dialect/
    – Genre: tweets = the suggestion is to use Twitter API to crawl twitter
    dataset.
    – DiscoverText is a program you can use to scoop out Twitter feeds
    really easily. Their website is here:
    http://discovertext.com/defaultDT2.aspx
    One can do a free 30 day trial and get a bunch of Twitter messages.

    Note:
    Genre: Tweets = The Edinburg Tweets corpus has been withdrawn:
    http://demeter.inf.ed.ac.uk/

    This post is also available here: http://linguistlist.org/issues/22/22-2068.html

  2. marinasantini.ms@gmail.com
    23 May, 2011 at 06:44

    Update:
    The Corporate Blogging Corpus (CBC/Corporati) by Cornelius Pushmann. Thesis and corpus downloadable from http://ynada.com/cbc-corporati/

  3. marinasantini.ms@gmail.com
    7 September, 2011 at 11:40

    Update:
    The new Twitter corpus, HERMES, is now available. It’s about 100 million words. There are also JSON files with metadata. It was created by Michele Zappavigna, University of Sidney

    Here is her webpage
    http://sydney.edu.au/arts/linguistics/staff/academic_staff/michele_zappavigna.shtml

    Contact her for more information about how to get hold of the corpus.

  4. 5 March, 2012 at 09:37

    Suggestion by Giacomo Inches

    Hi,

    I was working with this collection of different documents (twitter, chat, forum, ratings, comments), that you may find interesting:
    http://caw2.barcelonamedia.org/node/7

    Cheers
    Giacomo

    P.S.
    If you are interested there are some analysis of the collection in here:
    [1] Giacomo Inches, Mark James Carman, Fabio Crestani: Investigating the Statistical Properties of User-Generated Documents. FQAS 2011: 198-209, http://www.ir.inf.usi.ch/sites/default/files/GiacomoInches-fqas11.pdf
    [2] Giacomo Inches, Mark James Carman, Fabio Crestani: Statistics of Online User-Generated Short Documents. ECIR 2010: 649-652, http://www.ir.inf.usi.ch/sites/default/files/InchesGiacomo-Ecir2010-paper.pdf

    Giacomo Inches
    PhD Student
    Faculty of Informatics
    Università della Svizzera italiana, USI
    Via G. Buffi 13
    CH – 6904 Lugano
    (e) giacomo.inches@usi.ch
    (w) http://www.giacomo.inches.ch

  5. 5 March, 2012 at 09:39

    Suggestion by John K Pate http://homepages.inf.ed.ac.uk/s0930006/

    Micha Elsner has made his IRC dataset available under the software
    section of:

    http://www.cs.brown.edu/~melsner/

    The corresponding papers are:
    http://aclweb.org/anthology-new/J/J10/J10-3004.pdf
    http://www.cs.brown.edu/~melsner/chat.pdf

    Hope this helps,

    ==

  6. 5 March, 2012 at 09:49

    From: Betsy Barry [bbarry@illocutioninc.com]
    Subject: Free Twitter Lexicon Download

    Illocution Inc is offering a free download of the current Twitter English
    Lexicon. It includes one-gram and two-gram reports and is available in
    both data and text files. Have a look, it’s a very interesting data set. And
    it’s free!

    Download here:

    http://www.illocutioninc.com/Research/

Leave a Reply

Your email address will not be published. Required fields are marked *

*