
Title Generation and Keyphrase Extraction from Persian Scientific Texts
Mahdi Mohseni, Heshaam Faili
Abstract
Modern neural-based approaches, which usually rely on large volumes of training data, have presented magnificent progress in various fields of text processing. However, these approaches have not been studied adequately in low resource languages. In this paper we focus on title generation and keyphrase extraction in the Persian language. We build a large corpus of scientific Persian texts which enables us to train end-to-end neural models for generating titles and extracting keyphrases. We investigate the effect of input length on modeling Persian text in both tasks. Additionally, we compare subword-level processing with the word-level one and show that even a straightforward subword encoding method enhances results greatly on Persian as an agglutinative language. For keyphrase extraction we formulate the task in two different ways: training the model to output all keyphrases at once; training the model to output one keyphrase each time and then extract n-best keyphrases during decoding. The latter improves the performance greatly.