.. link: http://dgplug.org/summertraining/2013/posts/m0rin09ma3-planetparser-v2-20130714-163236.html .. description: .. tags: .. date: 2013/07/14 16:32:37 .. title: m0rin09ma3 planetparser v2 20130714-163236 .. slug: m0rin09ma3-planetparser-v2-20130714-163236 planetparser ============= Prerequisite ------------- I installed beautifulsoup4 and lxml modules for this assignment in my 'virt1' environment. .. code-block:: (virt1) $ yolk -l Python - 2.7.5 - active development (/usr/lib/python2.7/lib-dynload) beautifulsoup4 - 4.2.1 - active lxml - 3.2.1 - active pip - 1.3.1 - active setuptools - 0.6c11 - active wsgiref - 0.1.2 - active development (/usr/lib/python2.7) yolk - 0.4.3 - active This program will read `a web page`_ and output blog title and author. .. _a web page: http://planet.fedoraproject.org $ python planetparser.py A link to the `source code`_. .. _source code: https://github.com/m0rin09ma3/python-summer-training-2013/blob/master/planetparser/planetparser.py Sample output: --------------- .. code-block:: Renich Bon Ciric HowTo: Bind10 resolver @ Fedora 19 Fedora Indonesia FAD Klaten 2013; Menuju Klaten Go Online Tom 'spot' Callaway In Memory of Seth Vidal Explanation ------------ In the main function, retrieve data from URL and store them into a string. .. code-block:: python # fetch data s_url = 'http://planet.fedoraproject.org' try: f = urllib2.urlopen(s_url) except urllib2.URLError: print 'failed to open url', s_url else: html_doc = f.read() f.close() #print html_doc Using following filter conditions to retrieve blog title & author 1. extract data under
tag #. extract string contains '' .. code-block:: python # extract title & author tags_header = SoupStrainer(class_="blog-entry-header") soup = BeautifulSoup(html_doc, "lxml", parse_only=tags_header) #soup = BeautifulSoup(html_doc, "html.parser", parse_only=tags_header) #print soup for tag in soup.find_all('a', href=re.compile("http:")): print tag.string