.. link: http://dgplug.org/summertraining/2013/posts/jcaselles-planetparser-20130714-193927.html .. description: .. tags: .. date: 2013/07/14 19:39:27 .. title: JCaselles planetparser 20130714-193927 .. slug: jcaselles-planetparser-20130714-193927 Assignment: ----------- To write a script that prints the title and the author of every blog posted in http://planet.fedoraproject.org, making use of the virtualenv feature. Solution: --------- Virtual Environment Setup: ~~~~~~~~~~~~~~~~~~~~~~~~~~ As I had already installed python-virtualenv, I've jumpted directly to create a new virual environment:: $mkdir virtual $cd virtual $virtualenv virtual_planetparse The PYTHONDONTWRITEBYTECODE environment variable is not compatible with setuptools. Either use --distribute or unset PYTHONDONTWRITEBYTECODE. First problem arises. Setting PYTHONDONTWRITEBYTECODE="" works it around :: $PYTHONDONTWRITEBYTECODE="" virtualenv virtual_planetparse New python executable in virtual_planetparse/bin/python Installing setuptools............done. Installing pip...............done. $ source virtual_planetparse/bin/activate (virtual_planetparse) $ pip install beautifulsoup4 [...] Successfully installed beautifulsoup4 Cleaning up... (virtual_planetparse) $ pip install html5lib Successfully installed html5lib Cleaning up... Code ~~~~ `Link to the code `_ It works as explained in the comments. it gets the html of the site with urllub2.urlopen(). Then it parses it using BeautifulSoup, and select(). the syntax used to select the desired tags is the following:: ".blog-entry-author > a" # The tag "a" (link) inside the tag of class (note the point meaning class) "blog-entry-author" This is the whole code: .. code:: python :number-lines: 1 #!/usr/bin/env python # Assignment: Get the titles and authors of all the blogs feeded # at http://planet.fedoraproject.org. # # Student: Josep Caselles # Course: #dgplug Summer Training Course # Date: 14/07/2013 from sys import exit from urllib2 import urlopen from bs4 import BeautifulSoup URL_CONSTANT = "http://planet.fedoraproject.org" def print_blog_info (): """ This method will use BeautifulSoup to parse the content of the given url and extract from it the desired content. With select() method from BeautifulSoup you can get all tags given it's class, id, or any other attribute. for a complete reference, see http://tinyurl.com/nn4m7hg. Steps made: 1- Fetch the whole html with urllib2 urlopen() 2- "Soupe" it with BeautifulSoup 3- Select the desired tag's content 4- print accordingly """ try: html_doc = urlopen (URL_CONSTANT) except: exit("\nError: Something is wrong with http://planet.fedoraproject.org" " or your internet connection\n") html_souped = BeautifulSoup (html_doc) html_doc.close() z = 0 for x, y in zip(html_souped.select(".blog-entry-author > a"), html_souped.select(".blog-entry-title > a")): z += 1 print """ Blog Entry n. %.2i: ----------------- Tile: '%s' Author: %s """ % (z, y.string, x.string) if __name__ == "__main__": print_blog_info () exit(0)