Dear All,
I am doing some research in concept extraction from different types of texts or genres.
I am looking for free research corpora belonging to the following genres:
1) FAQs (I have already downloaded some small collections, but I would like to have a more comprehensive range of topics).
2) Chat logs transcripts (I have already downloaded the NPS Collection, 3 Codiac datasets and several smallish Many Eyes datasets)
3) Telephone conversation transcripts (missing)
4) Emails (I have already downloaded the Enron dataset and a couple of junk mail collections)
5) Tweets (missing, apparently the Edinburgh’s Twitter corpus is not available any more)
6) Corporate weblogs (missing)
I will be glad to share all the links and related documentation, once I got all the genres in the list.
Thanks in advance for your suggestions.
Marina
Summary of replies
Many thanks to Laura Christopherson, Cohan Sujay Carlos, Vineet
Yadav, Jason Teeple, Leslie Barrett, Joakim Nordström, Bob Kuhns,
Dong Wang, Dave Lewis, John Tait, and Loredana Cerrato.
Suggested Corpora and Resources in English if not stated otherwise
(not all of them are free of charge)
Genre-specific corpora:
– Genre: SMS Messages = NUS SMS corpus:
http://wing.comp.nus.edu.sg:8080/SMSCorpus/ (English / Chinese)
– Genre: chatlogs = CODIAC chatlogs
(http://data.eol.ucar.edu/codiac/dss/id=92.124;
http://data.eol.ucar.edu/codiac/dss/id=88.044;
http://data.eol.ucar.edu/codiac/dss/id=107.010)
– Genre: chatlogs = Many Eyes datasets: some chatlogs can be found
here:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets
– Genre: chats and switchboard conversations =
Switchboard corpus and NPS chat corpus samples NLTK in NLTK data
(http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml). The NPS
chat corpus (http://faculty.nps.edu/cmartell/NPSChat.htm) is a POS
tagged chat corpus and the switchboard corpus
(http://spot.colorado.edu/~michaeli/Lexsubj/swbd.html) is a telephonic
conversation corpus.
– The Linguistics Data Consortium has a good deal of telephone
conversation – many files and a variety of languages. See
http://www.ldc.upenn.edu/Catalog/byType.jsp#lexicon,%20speech,%20
text (not for free)
– Genre: blogs = The Corporate weblogs dataset in TREC datasets
(http://ir.dcs.gla.ac.uk/test_collections/) is not for free. Helpful wiki:
http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG
– Genre: corporate blogs = It is possible to pull corporate blog feeds
or scrape the blogs from this list:
http://www.debbieweil.com/blog/list-of-67-big-brand-corporate-blogs/
– The Göteborg Spoken Language Corpus and other corpora in
Swedish (http://spraakbanken.gu.se/)
– Genre: tweets = The twitter corpus associated with the paper
http://www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf is
here: https://sites.google.com/site/twittersentimenthelp/for-researchers
– Genre: tweets and other microblogs= MicroBlog track
http://sites.google.com/site/trecmicroblogtrack/ (not for free)
– Genre: Newswires: Reuters’ Newswires collections =
http://trec.nist.gov/data/reuters/reuters.html
– Genre: emails = Enron corpus (http://www.cs.cmu.edu/~enron/);
categorized Enron emails (http://sgi.nu/enron/corpora.php)
– Genre: emails = Junk email corpus
(http://clg.wlv.ac.uk/resources/junk-emails/index.php)
– Genre: FAQs = 200 FAQs
(http://www.itri.brighton.ac.uk/~Marina.Santini/#Download)
Resources:
– In terms of words and concept, there are two main resources for
English. First is WordNet, originally from Princeton, it is in NLTK (and
one can get it separately). It is English words ‘organized’ according to
their relationships: synonym, hyponym, piece of a whole, etc. The other
resource is Word Association Norms, one can get that from the
University of South Florida (http://w3.usf.edu/FreeAssociation/).
– Article: Hella Koo Finding: Twitter Dialect –
http://blogs.wsj.com/ideas-market/2011/02/08/hella-koo-finding-twitter-
dialect/
– Genre: tweets = the suggestion is to use Twitter API to crawl twitter
dataset.
– DiscoverText is a program you can use to scoop out Twitter feeds
really easily. Their website is here:
http://discovertext.com/defaultDT2.aspx
One can do a free 30 day trial and get a bunch of Twitter messages.
Note:
Genre: Tweets = The Edinburg Tweets corpus has been withdrawn:
http://demeter.inf.ed.ac.uk/
This post is also available here: http://linguistlist.org/issues/22/22-2068.html
Update:
The Corporate Blogging Corpus (CBC/Corporati) by Cornelius Pushmann. Thesis and corpus downloadable from http://ynada.com/cbc-corporati/
Update:
The new Twitter corpus, HERMES, is now available. It’s about 100 million words. There are also JSON files with metadata. It was created by Michele Zappavigna, University of Sidney
Here is her webpage
http://sydney.edu.au/arts/linguistics/staff/academic_staff/michele_zappavigna.shtml
Contact her for more information about how to get hold of the corpus.
Suggestion by Giacomo Inches
Hi,
I was working with this collection of different documents (twitter, chat, forum, ratings, comments), that you may find interesting:
http://caw2.barcelonamedia.org/node/7
Cheers
Giacomo
P.S.
If you are interested there are some analysis of the collection in here:
[1] Giacomo Inches, Mark James Carman, Fabio Crestani: Investigating the Statistical Properties of User-Generated Documents. FQAS 2011: 198-209, http://www.ir.inf.usi.ch/sites/default/files/GiacomoInches-fqas11.pdf
[2] Giacomo Inches, Mark James Carman, Fabio Crestani: Statistics of Online User-Generated Short Documents. ECIR 2010: 649-652, http://www.ir.inf.usi.ch/sites/default/files/InchesGiacomo-Ecir2010-paper.pdf
Giacomo Inches
PhD Student
Faculty of Informatics
Università della Svizzera italiana, USI
Via G. Buffi 13
CH – 6904 Lugano
(e) giacomo.inches@usi.ch
(w) http://www.giacomo.inches.ch
Suggestion by John K Pate http://homepages.inf.ed.ac.uk/s0930006/
Micha Elsner has made his IRC dataset available under the software
section of:
http://www.cs.brown.edu/~melsner/
The corresponding papers are:
http://aclweb.org/anthology-new/J/J10/J10-3004.pdf
http://www.cs.brown.edu/~melsner/chat.pdf
Hope this helps,
==
From: Betsy Barry [bbarry@illocutioninc.com]
Subject: Free Twitter Lexicon Download
Illocution Inc is offering a free download of the current Twitter English
Lexicon. It includes one-gram and two-gram reports and is available in
both data and text files. Have a look, it’s a very interesting data set. And
it’s free!
Download here:
http://www.illocutioninc.com/Research/