Dissemination: Web Corpora Available
1) Common Crawl web corpus — WebDataCommons is offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites. Two Common Crawl web corpora are available: one corpus consisting of 2.5 billion HTML pages dating from 2009/2010; a second corpus consisting of 1.4 billion HTML pages dating from February 2012. The 2009/2010 extraction resulted in 5.1 billion RDF quads which describe 1.5 billion entities and originate from 19.1 million websites. The February 2012 extraction resulted in 3.2 billion RDF quads which describe 1.2 billion entities and originate from 65.4 million websites. More detailed statistics about the distribution of formats, entities and websites serving structured data, as well as growth between 2009/2010 and 2012 is provided on the project website: http://webdatacommons.org/
2) SdeWaC — SdeWaC is a corpus created from a subset of the deWaC corpus. It contains about 44 million sentences and 884 million tokens. The sentences were selected on the grounds of being syntactically parsable with a standard dependency parser for German. A separate document (file “web-address-list.txt”) contains the details of the URLs of the source texts. See http://wacky.sslmit.unibo.it/ for more details.