Request: Corpus-Based Sublanguage Glossary

How to build a glossary of: specialized term = common word automatically?

Dear all,

I wonder if you have any experience or if you can provide references on how to build automatically  a glossary from genre-specific corpora. The glossary should be made of pairs in the form of: sublangage term = common/familiar word. For instance:

anemi = blood deficiency
analgesic = painkiller

Thanks in advance for suggestions and pointers.



3 comments for “Request: Corpus-Based Sublanguage Glossary

  1. 2 January, 2013 at 00:08

    Hi, Marina –

    Have you thought about using Wikipedia’s knowledge boxes or Google’s structured infos for this task? I would generate a list of definienda, extract the corresponding fields from said sources and add a human QA loop. Hope this helps.

    Kind regards –


  2. 9 January, 2013 at 12:09

    Hi Patrick,
    thanks for your suggestion!

  3. 9 January, 2013 at 12:12

    From: Text Analytics LinkedIn group

    Lance Norskog ‚ÄĘ How’s this:

    Assume that if two documents are related, they will have many words in common. Then the remaining disjoint words are also related. So,
    * rank all document vector pairs by the number of common words and score the remaining word pairs by this rank.

    * Add up all word pairs and sort them by collective rank.

    * Score individual words using the ranks of their word pairs.

    Now you have a large list of synonym pairs. The most common words are the glossary words. Their highest ranked pairs are the words which “explain” the glossary words.

    If you decide this approach has value, Singular-Value Decomposition gives a more precise method of assigning these ranks.

    * Make a term-document index for the corpus.

    * Do an SVD (Mahout’s SSVD job can handle this for large corpuses).

    * Zero out low-value singular values in SVD output. This reduces noise in the original document matrix.

    * Cluster term vectors (one entry for each document) by cosine distance, using the altered singular values matrix.

    Here’s the idea: if two documents share many terms, then the documents should be close. If the documents are close, then all terms are slightly related. But disjoint terms are orthogonal (90 degrees apart) in the document vectors. This 90-degree distance is noise! The distance should be smaller. That is, all terms in both documents are related, and the reduced singular matrix makes this clear. They are not direct synonyms, but as a whole have a tenuous bond. If you do all vector pairs and add the distances between all term pairs, the way that words are synonyms should have the highest score.

    For the rest of you, here’s a project I did last summer with singular-value decomposition. It is a different application of SVD to text processing, but may still help you understand the tool and its benefits.

Leave a Reply

Your email address will not be published. Required fields are marked *