Tag: web data

Dissemination: Web Corpora Available

1) Common Crawl web corpus — WebDataCommons is offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites.  Two Common Crawl web corpora are available: one corpus consisting of 2.5 billion HTML pages dating from…