عنوان

مدل ترکیبی مبتنی بر تحلیل روابط کلمات و مجموعه آیتم‌های مکرر وزن‌دار برای خلاصه‌سازی چندسندی,‮‭The combination of term relations analysis and weighted frequent itemset model for multi-document summarization‬

پدید آورنده

/آرش چاقری

موضوع

رده

کتابخانه

کتابخانه مرکزی و مرکز اسناد و انتشارات دانشگاه تبریز

محل استقرار

استان: آذربایجان شرقی ـ شهر: تبریز

تماس با کتابخانه : 04133294120-04133294118

شماره کتابشناسی ملی

شماره

‭۲۲۸۸۶پ‬

زبان اثر

زبان متن نوشتاري يا گفتاري و مانند آن

per

عنوان و نام پديدآور

عنوان اصلي

مدل ترکیبی مبتنی بر تحلیل روابط کلمات و مجموعه آیتم‌های مکرر وزن‌دار برای خلاصه‌سازی چندسندی

عنوان اصلي به زبان ديگر

‮‭The combination of term relations analysis and weighted frequent itemset model for multi-document summarization‬

نام نخستين پديدآور

/آرش چاقری

وضعیت نشر و پخش و غیره

نام ناشر، پخش کننده و غيره

: مهندسی برق و کامپیوتر

تاریخ نشرو بخش و غیره

، ‮‭۱۳۹۸‬

نام توليد کننده

، عباسپور

مشخصات ظاهری

نام خاص و کميت اثر

‮‭۱۲۰‬ص‬

یادداشتهای مربوط به نشر، بخش و غیره

متن يادداشت

چاپی

یادداشتهای مربوط به پایان نامه ها

جزئيات پايان نامه و نوع درجه آن

دکتری

نظم درجات

مهندسی کامپیوتر

زمان اعطا مدرک

‮‭۱۳۹۸/۱۱/۱۴‬

کسي که مدرک را اعطا کرده

تبریز

یادداشتهای مربوط به خلاصه یا چکیده

متن يادداشت

از ویژگی‌های مهم یک خلاصه‌ساز حفظ مطالب مهم موجود در متن اصلی است، به‌طوری‌که جملات اصلی به خلاصه منتقل شود و از مطالب کم‌اهمیت صرف‌نظر کند .در این رساله سعی شده است با استفاده از مجموعه آیتم‌های مکرر، توزیع برنولی، ویژگی‌های آماری و بهره‌گیری از رویکرد حریصانه، یک خلاصه‌ساز چندسندی ارائه گردد .ابتدا با استفاده از ابزار پیش‌پردازش‮‭OpenNLP‬ ، تمام اسناد به واحدهای معناداری به نام جمله تقسیم‌بندی می‌شوند .کلمات بی‌تأثیر مانند حروف اضافه، حروف تعریف و ضمایر حذف می‌شوند .ریشه‌یابی کلمات با استفاده از الگوریتم پورتر انجام می‌شود .همچنین با استفاده از برچسب‌زننده لغوی، فقط اسامی، افعال و صفات، مورد پردازش قرار می‌گیرند .در این رساله چهار ویژگی مدنظر قرارگرفته می‌شود .ویژگی اول بر اساس مدل ولنسی وزن کلمات را محاسبه می‌کند .ازآنجاکه مدل ولنسی با بهره‌گیری از مجموعه آیتم‌های مکرر، وزن آیتم‌ها را محاسبه می‌کند، در ابتدا به‌وسیله الگوریتم‌های کاوش مجموعه آیتم‌های مکرر نظیر الگوریتم‮‭Growth- FP‬، مجموعه آیتم‌های مکرر شناسایی می‌شوند .سپس وزن محلی و سراسری کلمات تشکیل‌دهنده هر یک از مجموعه آیتم‌ها محاسبه می‌شوند .سپس بر اساس وزن کلمات به‌دست‌آمده، ارزش هر یک از جملات بر اساس این ویژگی مشخص می‌شود .ویژگی دوم، روابط بین کلمات و میزان اطلاعات حاصل‌شده از هم‌نشینی کلمات با یکدیگر در پیکره را با استفاده از توزیع برنولی محاسبه می‌کند .ویژگی سوم و چهارم، ویژگی‌های آماری طول جمله و موقعیت جمله در سند هستند .درنهایت با رویکردی حریصانه و با استفاده از امتیاز حاصل از چهار ویژگی ذکرشده در این پژوهش برای هر یک از جملات، جملات متعلق به خلاصه خروجی انتخاب می‌شوند .از میان روش‌های ارزیابی خلاصه‌سازی اسناد، ابزار ‮‭ROUGE‬ برای ارزیابی نتایج استفاده می‌شود .برحسب معیار‮‭ROUGE‬ - ‮‭۲‬برای مجموعه داده ‮‭DUC۲۰۰۴‬ بیشترین میزان نرخ بهبود برابر با ‮‭۶۱.۱۹‬ درصد و کمترین میزان برابر با ‮‭۵‬ درصد و برای معیار‮‭ROUGE‬ - ‮‭۴‬و برای همین مجموعه داده، بیشترین میزان نرخ بهبود برابر با ‮‭۲۰۰‬ درصد و کمترین میزان برابر با ‮‭۵‬ درصد است .برای مجموعه داده ‮‭DUC۲۰۰۲‬ برای معیار‮‭ROUGE‬ - ‮‭۱‬بیشترین میزان نرخ بهبود برابر با ‮‭۱۱‬ درصد و کمترین میزان برابر با ‮‭۳‬ درصد است .همچنین برای معیار‮‭ROUGE‬ - ‮‭۲‬و برای همین مجموعه داده، بیشترین میزان برابر با ‮‭۹۷‬ درصد و کمترین میزان برابر با ‮‭۲‬ درصد است .درنهایت با استفاده از ابزار ‮‭ROUGE‬ و مقایسه نتایج روش پیشنهادی با سایر روش‌های خلاصه‌سازی مطرح، مشخص می‌شود، روش مطرح‌شده در این رساله برتری قابل‌ملاحظه‌ای در مقایسه با سایر روش‌های خلاصه‌سازی چندسندی دارد.

متن يادداشت

An important characteristic of a summarizer is to preserve the important content in the original text and neglect the others. In the present study, it is attempted to present a multi-document summarization based on frequent itemsets, Bernoulli distribution, statistical features and greedy approach. As for the segmentation, all the documents are divided into meaningful units (in this case sentences). For this purpose, an open source preprocessing tool called OpenNLP is used. Stop words such as prepositions, articles, and pronouns have low semantic contents and are omitted since they do not have any significant roles in detecting the most important sentences in the text. The most commonly used words in the English language such as a, an, the, etc. which have less significance with respect to the document are removed. In the present study, the built-in list was used for the SMART information retrieval system. Besides, by analyzing the documents it is found that some terms such as a.m., p.m., fla., edt. and all the tokens with fewer than three letters do not make any significant contribution to the summarization process. Accordingly, these terms are omitted from the documents. Stemming is a procedure by which the terms with the same stem or root are reduced to a common form by removing the variable suffixes. By investigating the human-generated summaries, it is observed that the majority of the tokens in the sentences of the summaries fall into the noun, verb, and adjective categories. Based on this observation, the Part-of-Speech (POS) tag is used to identify the nouns, verbs, and adjectives. In the present study, the amount of information between each two terms is measured by the Bernoulli model of randomness in the form of a new feature. Also, a study was conducted on multi-document summarization based on frequent itemsets. Indeed, the proposed method enriches frequent itemset mining by weighting the terms in the corpus. The present study aims at generating a summary by using frequent itemsets, defining new features based on term association measures, and considering the weights of the terms. The proposed method has some advantages: first, no learning phase is needed. Second, the proposed method considers itemset features, term associations, and statistical features simultaneously. Third, the proposed method does not need any additional resources such as ontology to consider the correlation between the terms. Finally, unlike recent methods which are based on itemset based summarization, the present approach considers the weights of the terms in the document collection as the terms are not equal in the document. The quality of the summary generated by the proposed method is evaluated using the official measure provided by the ROUGE toolkit (version 1.5.5) According to the ROUGE-2 measure for DUC2004 dataset, the highest improvement rate is 61.19 and the lowest rate is 5 , and for the ROUGE-4 measure and for this dataset, the highest improvement rate is 200 and the lowest rate is 5 . For the DUC2002 dataset for the ROUGE-1 measure, the highest improvement rate is 11 and the lowest is 3 . Also, for the ROUGE-2 measure and for this dataset, the highest value is 97 and the lowest is 2 . Based on the results of the DUC 2002 and DUC 2004 datasets obtained by the ROUGE toolkit, the proposed approach can outperform the state-of-the-art approaches significantly

عنوان اصلی به زبان دیگر

عنوان اصلي به زبان ديگر

‮‭The combination of term relations analysis and weighted frequent itemset model for multi-document summarization‬

نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )

مستند نام اشخاص تاييد نشده

چاقری، آرش

مستند نام اشخاص تاييد نشده

Chaghari, Arash

دسترسی و محل الکترونیکی

يادداشت عمومي

سیاه و سفید

وضعیت فهرست نویسی

نمایه‌سازی قبلی

پدید آورنده /آرش چاقری

موضوع

رده

کتابخانه کتابخانه مرکزی و مرکز اسناد و انتشارات دانشگاه تبریز

محل استقرار استان: آذربایجان شرقی ـ شهر: تبریز

شماره کتابشناسی ملی

زبان اثر

عنوان و نام پديدآور

وضعیت نشر و پخش و غیره

مشخصات ظاهری

یادداشتهای مربوط به نشر، بخش و غیره

یادداشتهای مربوط به پایان نامه ها

یادداشتهای مربوط به خلاصه یا چکیده

عنوان اصلی به زبان دیگر

نام شخص به منزله سر شناسه - (مسئولیت معنوی درجه اول )

دسترسی و محل الکترونیکی

وضعیت فهرست نویسی

پدید آورنده

/آرش چاقری

کتابخانه

کتابخانه مرکزی و مرکز اسناد و انتشارات دانشگاه تبریز

محل استقرار

استان: آذربایجان شرقی ـ شهر: تبریز