Dissemination: Web Corpora Available

1) Common Crawl web corpus — WebDataCommons is offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites.  Two Common Crawl web corpora are available: one corpus consisting of 2.5 billion HTML pages dating from 2009/2010;  a second corpus consisting of 1.4 billion HTML pages dating from February 2012. The 2009/2010 extraction resulted in 5.1 billion RDF quads which describe 1.5 billion entities and originate from 19.1 million websites. The February 2012 extraction resulted in 3.2 billion RDF quads which describe 1.2 billion entities and originate from 65.4 million websites. More detailed statistics about the distribution of formats, entities and websites serving structured data, as well as growth between 2009/2010 and 2012 is provided on the project website: http://webdatacommons.org/

2) SdeWaC — SdeWaC is a corpus created from a subset of the deWaC corpus. It contains about 44 million sentences and 884 million tokens. The sentences were selected on the grounds of being syntactically parsable with a standard dependency parser for German. A separate document (file “web-address-list.txt”) contains the details of the URLs of the source texts. See http://wacky.sslmit.unibo.it/  for more details.

1 comment for “Dissemination: Web Corpora Available

  1. 31 March, 2012 at 09:18

    From Applied Linguistics LinkedIn Group
    See discussion here: http://lnkd.in/BukR7q

    mitiku teshome • learners engagement changed teaching to learning. learning is something expected from ESL teachers. is there anything?

    Marina Santini • Hi Mitiku,
    try and have look. Or investigate existing learner corpora here: http://www.athel.com/learner_corpora.html

Leave a Reply

Your email address will not be published. Required fields are marked *