عنوان

مدل ترکیبی مبتنی بر تحلیل روابط کلمات و مجموعه آیتم‌های مکرر وزن‌دار برای خلاصه‌سازی چندسندی,‮‭The combination of term relations analysis and weighted frequent itemset model for multi-document summarization‬

پدید آورنده

/آرش چاقری

موضوع

رده

کتابخانه

University of Tabriz Library, Documentation and Publication Center

محل استقرار

استان: East Azarbaijan ـ شهر: Tabriz

تماس با کتابخانه : 04133294120-04133294118

NATIONAL BIBLIOGRAPHY NUMBER

Number

‭۲۲۸۸۶پ‬

LANGUAGE OF THE ITEM

.Language of Text, Soundtrack etc

per

TITLE AND STATEMENT OF RESPONSIBILITY

Title Proper

مدل ترکیبی مبتنی بر تحلیل روابط کلمات و مجموعه آیتم‌های مکرر وزن‌دار برای خلاصه‌سازی چندسندی

Parallel Title Proper

‮‭The combination of term relations analysis and weighted frequent itemset model for multi-document summarization‬

First Statement of Responsibility

/آرش چاقری

.PUBLICATION, DISTRIBUTION, ETC

Name of Publisher, Distributor, etc.

: مهندسی برق و کامپیوتر

Date of Publication, Distribution, etc.

، ‮‭۱۳۹۸‬

Name of Manufacturer

، عباسپور

PHYSICAL DESCRIPTION

Specific Material Designation and Extent of Item

‮‭۱۲۰‬ص‬

NOTES PERTAINING TO PUBLICATION, DISTRIBUTION, ETC.

Text of Note

چاپی

DISSERTATION (THESIS) NOTE

Dissertation or thesis details and type of degree

دکتری

Discipline of degree

مهندسی کامپیوتر

Date of degree

‮‭۱۳۹۸/۱۱/۱۴‬

Body granting the degree

تبریز

SUMMARY OR ABSTRACT

Text of Note

از ویژگی‌های مهم یک خلاصه‌ساز حفظ مطالب مهم موجود در متن اصلی است، به‌طوری‌که جملات اصلی به خلاصه منتقل شود و از مطالب کم‌اهمیت صرف‌نظر کند .در این رساله سعی شده است با استفاده از مجموعه آیتم‌های مکرر، توزیع برنولی، ویژگی‌های آماری و بهره‌گیری از رویکرد حریصانه، یک خلاصه‌ساز چندسندی ارائه گردد .ابتدا با استفاده از ابزار پیش‌پردازش‮‭OpenNLP‬ ، تمام اسناد به واحدهای معناداری به نام جمله تقسیم‌بندی می‌شوند .کلمات بی‌تأثیر مانند حروف اضافه، حروف تعریف و ضمایر حذف می‌شوند .ریشه‌یابی کلمات با استفاده از الگوریتم پورتر انجام می‌شود .همچنین با استفاده از برچسب‌زننده لغوی، فقط اسامی، افعال و صفات، مورد پردازش قرار می‌گیرند .در این رساله چهار ویژگی مدنظر قرارگرفته می‌شود .ویژگی اول بر اساس مدل ولنسی وزن کلمات را محاسبه می‌کند .ازآنجاکه مدل ولنسی با بهره‌گیری از مجموعه آیتم‌های مکرر، وزن آیتم‌ها را محاسبه می‌کند، در ابتدا به‌وسیله الگوریتم‌های کاوش مجموعه آیتم‌های مکرر نظیر الگوریتم‮‭Growth- FP‬، مجموعه آیتم‌های مکرر شناسایی می‌شوند .سپس وزن محلی و سراسری کلمات تشکیل‌دهنده هر یک از مجموعه آیتم‌ها محاسبه می‌شوند .سپس بر اساس وزن کلمات به‌دست‌آمده، ارزش هر یک از جملات بر اساس این ویژگی مشخص می‌شود .ویژگی دوم، روابط بین کلمات و میزان اطلاعات حاصل‌شده از هم‌نشینی کلمات با یکدیگر در پیکره را با استفاده از توزیع برنولی محاسبه می‌کند .ویژگی سوم و چهارم، ویژگی‌های آماری طول جمله و موقعیت جمله در سند هستند .درنهایت با رویکردی حریصانه و با استفاده از امتیاز حاصل از چهار ویژگی ذکرشده در این پژوهش برای هر یک از جملات، جملات متعلق به خلاصه خروجی انتخاب می‌شوند .از میان روش‌های ارزیابی خلاصه‌سازی اسناد، ابزار ‮‭ROUGE‬ برای ارزیابی نتایج استفاده می‌شود .برحسب معیار‮‭ROUGE‬ - ‮‭۲‬برای مجموعه داده ‮‭DUC۲۰۰۴‬ بیشترین میزان نرخ بهبود برابر با ‮‭۶۱.۱۹‬ درصد و کمترین میزان برابر با ‮‭۵‬ درصد و برای معیار‮‭ROUGE‬ - ‮‭۴‬و برای همین مجموعه داده، بیشترین میزان نرخ بهبود برابر با ‮‭۲۰۰‬ درصد و کمترین میزان برابر با ‮‭۵‬ درصد است .برای مجموعه داده ‮‭DUC۲۰۰۲‬ برای معیار‮‭ROUGE‬ - ‮‭۱‬بیشترین میزان نرخ بهبود برابر با ‮‭۱۱‬ درصد و کمترین میزان برابر با ‮‭۳‬ درصد است .همچنین برای معیار‮‭ROUGE‬ - ‮‭۲‬و برای همین مجموعه داده، بیشترین میزان برابر با ‮‭۹۷‬ درصد و کمترین میزان برابر با ‮‭۲‬ درصد است .درنهایت با استفاده از ابزار ‮‭ROUGE‬ و مقایسه نتایج روش پیشنهادی با سایر روش‌های خلاصه‌سازی مطرح، مشخص می‌شود، روش مطرح‌شده در این رساله برتری قابل‌ملاحظه‌ای در مقایسه با سایر روش‌های خلاصه‌سازی چندسندی دارد.

Text of Note

An important characteristic of a summarizer is to preserve the important content in the original text and neglect the others. In the present study, it is attempted to present a multi-document summarization based on frequent itemsets, Bernoulli distribution, statistical features and greedy approach. As for the segmentation, all the documents are divided into meaningful units (in this case sentences). For this purpose, an open source preprocessing tool called OpenNLP is used. Stop words such as prepositions, articles, and pronouns have low semantic contents and are omitted since they do not have any significant roles in detecting the most important sentences in the text. The most commonly used words in the English language such as a, an, the, etc. which have less significance with respect to the document are removed. In the present study, the built-in list was used for the SMART information retrieval system. Besides, by analyzing the documents it is found that some terms such as a.m., p.m., fla., edt. and all the tokens with fewer than three letters do not make any significant contribution to the summarization process. Accordingly, these terms are omitted from the documents. Stemming is a procedure by which the terms with the same stem or root are reduced to a common form by removing the variable suffixes. By investigating the human-generated summaries, it is observed that the majority of the tokens in the sentences of the summaries fall into the noun, verb, and adjective categories. Based on this observation, the Part-of-Speech (POS) tag is used to identify the nouns, verbs, and adjectives. In the present study, the amount of information between each two terms is measured by the Bernoulli model of randomness in the form of a new feature. Also, a study was conducted on multi-document summarization based on frequent itemsets. Indeed, the proposed method enriches frequent itemset mining by weighting the terms in the corpus. The present study aims at generating a summary by using frequent itemsets, defining new features based on term association measures, and considering the weights of the terms. The proposed method has some advantages: first, no learning phase is needed. Second, the proposed method considers itemset features, term associations, and statistical features simultaneously. Third, the proposed method does not need any additional resources such as ontology to consider the correlation between the terms. Finally, unlike recent methods which are based on itemset based summarization, the present approach considers the weights of the terms in the document collection as the terms are not equal in the document. The quality of the summary generated by the proposed method is evaluated using the official measure provided by the ROUGE toolkit (version 1.5.5) According to the ROUGE-2 measure for DUC2004 dataset, the highest improvement rate is 61.19 and the lowest rate is 5 , and for the ROUGE-4 measure and for this dataset, the highest improvement rate is 200 and the lowest rate is 5 . For the DUC2002 dataset for the ROUGE-1 measure, the highest improvement rate is 11 and the lowest is 3 . Also, for the ROUGE-2 measure and for this dataset, the highest value is 97 and the lowest is 2 . Based on the results of the DUC 2002 and DUC 2004 datasets obtained by the ROUGE toolkit, the proposed approach can outperform the state-of-the-art approaches significantly

PARALLEL TITLE PROPER

Parallel Title

‮‭The combination of term relations analysis and weighted frequent itemset model for multi-document summarization‬

PERSONAL NAME - PRIMARY RESPONSIBILITY

چاقری، آرش

Chaghari, Arash

ELECTRONIC LOCATION AND ACCESS

Public note

سیاه و سفید

نمایه‌سازی قبلی

پدید آورنده /آرش چاقری

موضوع

رده

کتابخانه University of Tabriz Library, Documentation and Publication Center

محل استقرار استان: East Azarbaijan ـ شهر: Tabriz

NATIONAL BIBLIOGRAPHY NUMBER

LANGUAGE OF THE ITEM

TITLE AND STATEMENT OF RESPONSIBILITY

.PUBLICATION, DISTRIBUTION, ETC

PHYSICAL DESCRIPTION

NOTES PERTAINING TO PUBLICATION, DISTRIBUTION, ETC.

DISSERTATION (THESIS) NOTE

SUMMARY OR ABSTRACT

PARALLEL TITLE PROPER

PERSONAL NAME - PRIMARY RESPONSIBILITY

ELECTRONIC LOCATION AND ACCESS

پدید آورنده

/آرش چاقری

کتابخانه

University of Tabriz Library, Documentation and Publication Center

محل استقرار

استان: East Azarbaijan ـ شهر: Tabriz