.. link: http://dgplug.org/summertraining/2013/posts/thyarmageddon-planet-parser-20130715-082233.html
.. description:
.. tags:
.. date: 2013/07/15 08:22:33
.. title: ThyArmageddon Planet Parser 20130715-082233
.. slug: thyarmageddon-planet-parser-20130715-082233
Planet Parser
-------------
This script will parse `Planet Fedora`_ and output the information from the page in a human readable way to the terminal. You can find the script at the following link_.
Setup
-----
The first thing that needs to be done is create a *virtual environment* and install the needed modules. For this script, we need *BeautifulSoup*.
.. code::
$ virtualenv pparser
New python executable in pparser/bin/python2.7
Also creating executable in pparser/bin/python
Installing setuptools............done.
Installing pip...............done.
$ source pparser/bin/activate
(pparser)$ pip install beautifulsoup4
Downloading/unpacking beautifulsoup4
Downloading beautifulsoup4-4.2.1.tar.gz (139Kb): 139Kb downloaded
Running setup.py egg_info for package beautifulsoup4
Installing collected packages: beautifulsoup4
Running setup.py install for beautifulsoup4
Successfully installed beautifulsoup4
Cleaning up...
The Code
--------
.. code:: python
#!/usr/bin/env python
"""
planetparser is a script that parses the information on
http://planet.fedoraproject.org/ and prints the
the post title, the author, the link to the original
post and the post itself to the terminal
"""
from urllib import urlopen
from sys import exit, argv
from bs4 import BeautifulSoup
import re
def ParseAuthor(link):
"""
In here, we use a regex to find and output the names of the authors
on the whole page. This will return a list of the names
"""
PatternAuthor = re.compile('
')
return re.findall(PatternAuthor, link)
def ParsePostTitle(link):
"""
In here, we use a regex to find and output the post titles
on the whole page. This will return a list of the titles
"""
PatternPostTitle = re.compile('
')
return re.findall(PatternPostTitle, link)
def ParseLink(link):
"""
In here, we use a regex to find and output the post links
on the whole page. This will return a list of the links
"""
PatternLink = re.compile('
')
return re.findall(PatternLink, link)
def ParsePost(link):
"""
This function uses BeautifulSoup to find the content of the
posts and will return the list of posts in html unchanged
"""
Soup = BeautifulSoup(link)
Posts = Soup.findAll(attrs={"class":"blog-entry-content"})
return Posts
def PrintList(ListAuthor, ListPostTitle, ListLink, NoPost ,ListPost=''):
"""
This function will print out the information given to it in lists
in a formatted way to the terminal
"""
print ""
print "Fedora Planet"
print "-------------\n"
for i in range(len(ListAuthor)):
print "Author: %s" % ListAuthor[i]
print "Post Title: %s" % ListPostTitle[i]
print "Link: %s" % ListLink[i]
if NoPost == 0:
print "-" * (len(ListLink[i]) + 6)
print "\n"
# We use .text to get only the text; strip html tags
print "\t%s" % ListPost[i].text
print "\n"
print "*" * 100
print "\n"
if __name__ == '__main__':
"""
The first thing we need to do is open the url and read it
We'll raise an exception if this doesn't work for some reason
and we'll exit the script
"""
NoPost = 0
if len(argv) > 2:
print "Too many arguments"
print "Please use -h or --help for further help"
exit(1)
if len(argv) == 2:
if argv[1] == '-h' or argv[1] == '--help':
print "Usage: ./planetparser.py [OPTIONS]"
print "Parses Planet Fedora and outputs information from the page.\n"
print "Mandatory arguments"
print "-h, --help\t\tprint this help page"
print "-n, --no-post\t\tdo not print posts"
exit(1)
elif argv[1] == '-n' or argv[1] == '--no-post':
NoPost = 1
else:
print "Wrong arguments"
print "Please use -h or --help for further help"
exit(1)
try:
link = urlopen("http://planet.fedoraproject.org/").read()
except IOError:
print "Could not connect to website"
print "Please check your connection and try again"
exit(1)
# Get the list of authors
ListAuthor = ParseAuthor(link)
# Get the list of post titles
ListPostTitle = ParsePostTitle(link)
# Get the list of the links
ListLink = ParseLink(link)
"""
If the user does not want to display the posts
Don't bother to parse them
"""
if NoPost == 0:
# Get the posts posted on the page
ListPost = ParsePost(link)
PrintList(ListAuthor, ListPostTitle, ListLink, NoPost, ListPost)
# Print the output in a formated manner
else :
PrintList(ListAuthor, ListPostTitle, ListLink, NoPost)
exit(0)
Usage and Examples
------------------
Help Page
"""""""""
.. code::
(pparser)$ ./planetparser.py -h
Usage: ./planetparser.py [OPTIONS]
Parses Planet Fedora and outputs information from the page.
Mandatory arguments
-h, --help print this help page
-n, --no-post do not print posts
Output Without Post
"""""""""""""""""""
.. code:: bash
(pparser)$ ./planetparser.py -n
Fedora Planet
-------------
Author: Onuralp SEZER
Post Title: Fedora 19 With Google-authenticator login
Link: http://thunderbirdtrr.blogspot.com/2013/07/fedora-19-with-google-authenticator.html
Author: Neville A. Cross - YN1V
Post Title: Alistando Fedora 19 Release Party Managua
Link: http://www.taygon.com/?p=827
Author: Ruth Suehle
Post Title: How to run Pidora in QEMU
Link: http://hobbyhobby.wordpress.com/2013/07/14/how-to-run-pidora-in-qemu/
...
Output with Post
""""""""""""""""
.. code::
(pparser)$ ./planetparser.py
Fedora Planet
-------------
Author: Onuralp SEZER
Post Title: Fedora 19 With Google-authenticator login
Link: http://thunderbirdtrr.blogspot.com/2013/07/fedora-19-with-google-authenticator.html
-----------------------------------------------------------------------------------------
Hello everyone ; Novadays I was thinking about how do I get more secure system on my Fedora 19. (...)
****************************************************************************************************
Author: Neville A. Cross - YN1V
Post Title: Alistando Fedora 19 Release Party Managua
Link: http://www.taygon.com/?p=827
----------------------------------
Una de las cosas que se espera de un lanzamiento de una nueva versión de Fedora son los discos. (...)
...
.. _Planet Fedora: http://planet.fedoraproject.org/
.. _link: https://raw.github.com/ThyArmageddon/dgplug/master/planetparser/planetparser.py