To write a script that prints the title and the author of every blog posted in http://planet.fedoraproject.org, making use of the virtualenv feature.
As I had already installed python-virtualenv, I've jumpted directly to create a new virual environment:
$mkdir virtual $cd virtual $virtualenv virtual_planetparse The PYTHONDONTWRITEBYTECODE environment variable is not compatible with setuptools. Either use --distribute or unset PYTHONDONTWRITEBYTECODE.
First problem arises. Setting PYTHONDONTWRITEBYTECODE="" works it around
$PYTHONDONTWRITEBYTECODE="" virtualenv virtual_planetparse New python executable in virtual_planetparse/bin/python Installing setuptools............done. Installing pip...............done. $ source virtual_planetparse/bin/activate (virtual_planetparse) $ pip install beautifulsoup4 [...] Successfully installed beautifulsoup4 Cleaning up... (virtual_planetparse) $ pip install html5lib Successfully installed html5lib Cleaning up...
It works as explained in the comments. it gets the html of the site with urllub2.urlopen(). Then it parses it using BeautifulSoup, and select(). the syntax used to select the desired tags is the following:
".blog-entry-author > a" # The tag "a" (link) inside the tag of class (note the point meaning class) "blog-entry-author"
This is the whole code:
1 #!/usr/bin/env python 2 3 # Assignment: Get the titles and authors of all the blogs feeded 4 # at http://planet.fedoraproject.org. 5 # 6 # Student: Josep Caselles 7 # Course: #dgplug Summer Training Course 8 # Date: 14/07/2013 9 10 from sys import exit 11 from urllib2 import urlopen 12 from bs4 import BeautifulSoup 13 14 URL_CONSTANT = "http://planet.fedoraproject.org" 15 16 def print_blog_info (): 17 18 """ 19 This method will use BeautifulSoup to parse the content of the given url 20 and extract from it the desired content. With select() method from 21 BeautifulSoup you can get all tags given it's class, id, or any other 22 attribute. for a complete reference, see http://tinyurl.com/nn4m7hg. 23 24 Steps made: 25 1- Fetch the whole html with urllib2 urlopen() 26 2- "Soupe" it with BeautifulSoup 27 3- Select the desired tag's content 28 4- print accordingly 29 30 """ 31 32 try: 33 html_doc = urlopen (URL_CONSTANT) 34 35 except: 36 exit("\nError: Something is wrong with http://planet.fedoraproject.org" 37 " or your internet connection\n") 38 39 html_souped = BeautifulSoup (html_doc) 40 html_doc.close() 41 42 z = 0 43 44 for x, y in zip(html_souped.select(".blog-entry-author > a"), 45 html_souped.select(".blog-entry-title > a")): 46 47 z += 1 48 49 print """ 50 Blog Entry n. %.2i: 51 ----------------- 52 53 Tile: '%s' 54 Author: %s 55 """ % (z, y.string, x.string) 56 57 58 if __name__ == "__main__": 59 print_blog_info () 60 exit(0)