We are building a small corpus of medical documents from the web. The sublanguage used in these web documents has been annotated as “lay” or “specialised” by two annotators (annotation still ongoing). We would like to use this corpus for bootstrapping (semi-supervised and weakly- supervised learning), for lay-specialized terminology extraction, for the automatic identification of related terms, and for similar tasks. If you have suggestions and hints to interesting new directions in this field of research, we would gladly hear from you.
Abstract and link to the paper follow.
A Web Corpus for eCare: Collection, Annotation and Learning- Preliminary Results – DRAFT: 20 March 2017
by Marina Santini, Marjan Alirezai, Mikael Nyström, and Arne Jönsson
We present eCare Sv, Beta, a small corpus of web documents written in Swedish. The content of the documents refers to pre-selected medical concepts. The sublanguage used in each document has been labelled as lay or specialized by two annotators. The corpus is structured as a graph and designed as a exible and dynamic text resource, where additional concept-related documents can be appended both in Swedish and in other languages over time. We also present exploratory experiments based on supervised machine learning. Results indicate that the lay-specialized labels in the corpus can be reliably learned by standard classifiers regardless of noise and scalability issues.
Keywords: web corpus, medical corpus, lay-specialized annotation, biomedical text mining, metadata and interoperability, supervised machine learning
Read full draft paper here (copy and paste the link):