View Full Version : دریافت پاراگراف های صفحه html با استفاده از Regex

پنج شنبه 31 اردیبهشت 1394, 15:55 عصر
من میخوام پاراگراف های صفحه html را با استفاده از Regex استخراج کنم
مثال :
صفحه html:

<p>A <b>book</b> is a set of written, printed, illustrated, or blank sheets, made of <a href="/wiki/Ink" title="Ink">ink</a>, <a href="/wiki/Paper" title="Paper">paper</a>, <a href="/wiki/Parchment" title="Parchment">parchment</a>, or other materials, fastened together to hinge at one side. A single sheet within a book is a <a href="/wiki/Recto" title="Recto" class="mw-redirect">leaf</a>, and each side of a leaf is a <a href="/wiki/Page_(paper)" title="Page (paper)">page</a>. A set of text-filled or illustrated pages produced in electronic format is known as an electronic book, or <a href="/wiki/E-book" title="E-book">e-book</a>.</p>
<p>Books may also refer to works of literature, or a main division of such a work. In <a href="/wiki/Library_and_information_science" title="Library and information science">library and information science</a>, a book is called a <a href="/wiki/Monograph" title="Monograph">monograph</a>, to distinguish it from serial <a href="/wiki/Periodical" title="Periodical" class="mw-redirect">periodicals</a> such as <a href="/wiki/Magazine" title="Magazine">magazines</a>, <a href="/wiki/Academic_journal" title="Academic journal">journals</a> or <a href="/wiki/Newspaper" title="Newspaper">newspapers</a>. The body of all written works including books is <a href="/wiki/Literature" title="Literature">literature</a>. In <a href="/wiki/Novel" title="Novel">novels</a> and sometimes other types of books (for example, biographies), a book may be divided into several large sections, also called books (Book 1, Book 2, Book 3, and so on). An avid reader of books is a <a href="/wiki/Bibliophilia" title="Bibliophilia">bibliophile</a> or colloquially, <i>bookworm</i>.</p>
<p>A shop where <a href="/wiki/Bookselling" title="Bookselling">books are bought and sold</a> is a bookshop or bookstore. Books can also be borrowed from <a href="/wiki/Lending_library" title="Lending library">libraries</a>. <a href="/wiki/Google" title="Google">Google</a> has estimated that as of 2010, approximately 130,000,000 unique titles had been published.<sup id="cite_ref-1" class="reference"><a href="#cite_note-1"><span>[</span>1<span>]</span></a></sup></p>
<div id="toc" class="toc">
<div id="toctitle">

حالا من میخوام فقط پاراگراف هایم آنهایی که بین تگ های <p> </p> قرار داره را استخراج کنم ولی بدون تگهای اضافی مثل <a> <h2>

و نتیجه مثلاً بشود :

A book is a set of written, printed, illustrated, or blank sheets, made of ink (https://en.wikipedia.org/wiki/Ink), paper (https://en.wikipedia.org/wiki/Paper), parchment (https://en.wikipedia.org/wiki/Parchment), or other materials, fastened together to hinge at one side. A single sheet within a book is a leaf (https://en.wikipedia.org/wiki/Recto), and each side of a leaf is a page (https://en.wikipedia.org/wiki/Page_%28paper%29). A set of text-filled or illustrated pages produced in electronic format is known as an electronic book, or e-book (https://en.wikipedia.org/wiki/E-book)

پنج شنبه 31 اردیبهشت 1394, 22:51 عصر
دوستان لطفا راهنمایی کنند