JCaselles planetparser 20130714-193927

Posted: 2013-07-14 19:39

Assignment:

To write a script that prints the title and the author of every blog posted in http://planet.fedoraproject.org, making use of the virtualenv feature.

Solution:

Virtual Environment Setup:

As I had already installed python-virtualenv, I've jumpted directly to create a new virual environment:

$mkdir virtual
$cd virtual
$virtualenv virtual_planetparse
The PYTHONDONTWRITEBYTECODE environment variable is not compatible with setuptools. Either use --distribute or unset PYTHONDONTWRITEBYTECODE.

First problem arises. Setting PYTHONDONTWRITEBYTECODE="" works it around

$PYTHONDONTWRITEBYTECODE="" virtualenv virtual_planetparse
New python executable in virtual_planetparse/bin/python
Installing setuptools............done.
Installing pip...............done.
$ source virtual_planetparse/bin/activate
(virtual_planetparse) $ pip install beautifulsoup4
[...]
Successfully installed beautifulsoup4
Cleaning up...
(virtual_planetparse) $ pip install html5lib
Successfully installed html5lib
Cleaning up...

Code

Link to the code

It works as explained in the comments. it gets the html of the site with urllub2.urlopen(). Then it parses it using BeautifulSoup, and select(). the syntax used to select the desired tags is the following:

".blog-entry-author > a" # The tag "a" (link) inside the tag of class (note the point meaning class) "blog-entry-author"

This is the whole code:

 1 #!/usr/bin/env python
 2 
 3 # Assignment: Get the titles and authors of all the blogs feeded
 4 # at http://planet.fedoraproject.org.
 5 #
 6 # Student: Josep Caselles
 7 # Course: #dgplug Summer Training Course
 8 # Date: 14/07/2013
 9 
10 from sys import exit
11 from urllib2 import urlopen
12 from bs4 import BeautifulSoup
13 
14 URL_CONSTANT = "http://planet.fedoraproject.org"
15 
16 def print_blog_info ():
17 
18     """
19     This method will use BeautifulSoup to parse the content of the given url
20     and extract from it the desired content. With select() method from
21     BeautifulSoup you can get all tags given it's class, id, or any other
22     attribute. for a complete reference, see http://tinyurl.com/nn4m7hg.
23 
24     Steps made:
25         1- Fetch the whole html with urllib2 urlopen()
26         2- "Soupe" it with BeautifulSoup
27         3- Select the desired tag's content
28         4- print accordingly
29 
30     """
31 
32     try:
33         html_doc = urlopen (URL_CONSTANT)
34 
35     except:
36         exit("\nError: Something is wrong with http://planet.fedoraproject.org"
37              " or your internet connection\n")
38 
39     html_souped = BeautifulSoup (html_doc)
40     html_doc.close()
41 
42     z = 0
43 
44     for x, y in zip(html_souped.select(".blog-entry-author > a"),
45                     html_souped.select(".blog-entry-title > a")):
46 
47         z += 1
48 
49         print """
50 Blog Entry n. %.2i:
51 -----------------
52 
53 Tile: '%s'
54 Author: %s
55         """ % (z, y.string, x.string)
56 
57 
58 if __name__ == "__main__":
59     print_blog_info ()
60     exit(0)