27(2)
/
2022 / 12
/
pp. 1 - 30
Aligning Sentences in a Paragraph ParaphrasedCorpus with New Embedding-based Similarity Measures
作者
Aleksandra Smolka *
(Social Networks and Human Centered Computing, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica)
Hsin-Min Wang
(Institute of Information Science, Academia Sinica)
Jason S. Chang
(Department of Computer Science, National Tsing Hua University)
Keh-Yih Su
(Institute of Information Science, Academia Sinica)
Aleksandra Smolka *
Social Networks and Human Centered Computing, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica
Hsin-Min Wang
Institute of Information Science, Academia Sinica
Jason S. Chang
Department of Computer Science, National Tsing Hua University
Keh-Yih Su
Institute of Information Science, Academia Sinica
英文摘要
To better understand and utilize lexical and syntactic mapping between various language expressions, it is often first necessary to perform sentence alignment on the provided data. Up until now, the character trigram overlapping ratio was considered to be the best similarity measure on the text simplification corpus. In this paper, we aim to show that a newer embedding-based similarity metric will be preferable to the traditional SOTA metric on the paragraph paraphrased corpus. We report a series of experiments designed to compare different alignment search strategies as well as various embedding- and non-embedding-based sentence similarity metrics in the paraphrased sentence alignment task. Additionally, we explore the problem of aligning and extracting sentences with imposed restrictions, such as controlling sentence complexity. For evaluation, we use paragraph pairs sampled from the Webis-CPC-11 corpus containing paraphrased paragraphs. Our results indicate that modern embedding-based metrics such as those utilizing SentenceBERT or BERTScore significantly outperform the character trigram overlapping ratio in the sentence alignment task in the paragraph-paraphrased corpus.
英文關鍵字
Sentence Alignment; Sentence Similarity; Sentence Embedding; Paragraph-paraphrased Corpus