Question Answering and Summarization

  • 更新时间: 2018-09-20
  • 来源:
  • 浏览数: 5次
  • 字数: 6283
  • 发表评论

Information Retrieval [J&M 23.1]

  • given a collection of documents and a user query, produce a ranked list of documents relevant to the query
    • this is a subproblem of web retrieval, whcih also takes account of links
  • main approach is bag of words:  treat both each document and the query as a set of terms (words)
  • characterize each document and the query by a term vector:  a vector whose i-th element is the frequency of the i-th term (the "term frequency" or tf)
  • compute similarity of document and query using normalized dot product and rank documents based on their similarity
  • can weight i-th element by inverse document frequency of i-th term:
          idf[i] = log ( numer of documents in collection / number of documents containing i-th term)
    this weights rare terms more heavily
  • for searches in limited collections, can expand query terms using hand-built network (e.g., WordNet) or corpus-derived term similarity
  • evaluate by recall/precision curve

Factoid Question Answering [J&M 23.2]

  • finding in a document collection (or the Web) the specific answer to a question (not just a document containing the answer)
  • in principle we should be able to do this by transforming the question and the documents to a semantic representation and looking for a match at the semantic level
    • but this is computationally costly for very large collections (or the Web)
    • and more critically will often fail to find a match because of differences in how the question and answer are formulated ... differences which are not bridged by current semantic analysis
    • so most current QA uses a more robust mix of IR and shallow linguistic analysis
  • QA begins with question analysis, which aims to determine what type of answer is required
    • use a rich hierarchy of answer types ... typically > 100 types and subtypes
    • develop an extended NE tagger for these types
    • develop a classifier which maps questions into answer types, either by hand (writing patterns using shallow parsing) or by training from a set of questions and answer types
  • QA then proceeds in 3 steps:
    1. document retrieval (using IR techniques)
    2. passage retrieval ... selecting passages or sentences in the retrieved documents which are likely to contain the answer, based on the presence of individual keywords or sequences of keywords from the question as well as NEs of the type identified by question analysis
    3. answer extraction, using either the extended NE tagger or (for some answer types) specific patterns.  These are similar to relation extraction patterns, and can be written by hand or learned by bootstrapping.

Summarization [J&M 23.3]

Types of summarization

  • single document vs. multi-document
  • generic vs focused
  • extract vs abstract

Generic single-document summarization

  • content selection:  picking the important sentences
    • generate a topic signature:  set of important terms
      • for example, weight each term by tf*idf
      • weight each sentence by average term weight
      • rank sentences by weight and select top sentences
    • other features to use in ranking sentences
      • position (e.g., in news articles)
      • cue phrases
    • train content selection classifier from corpus of documents and their abstracts
  • ordering:  typically use original sentence order
  • sentence simplification
    • generally based on parse tree
    • delete selected constituents (appositives, attribution clauses, PPs)
    • measure 'information content' of constituents

Generic multi-document summarization

  • new issues:  redundancy and ordering
  • dealing with redundancy
    • MMR:  maximal marginal relevance
      • extract sentences incrementally
      • reduce sentence  weight based on maximum of similarity to previously extracted sentences
    • clustering
      • compute similarity of all sentence pairs
      • cluster sentences based on similarity
      • select representative sentence from each cluster
  • ordering
    • aim to produce a coherent summary, e.g., through lexical overlap

Focused summarization

  • general approach:  use similarity to user query as a factor in sentence / snippet selection
  • task-specific approach (e.g., for biographical summaries):  build a document template for answers to a specific question type;  use extraction techniques to fill some slots

标签: vector vs shallow parsing log tf

我来评分 :6


本站遵循:署名-非商业性使用-禁止演绎 3.0 共享协议