			   PyTvGrab Library
			   ================

by Chris Ottrey (ottrey at sourceforge dot net)
   Gustavo Sverzut Barbieri (gsbarbieri at sourceforge dot net)


1 - Introduction
################

PyTvGrab-lib is an XMLTV grabber library for Python.
It extracts information from a webpage and outputs in the xmltv format
(version 0.5.15 see http://xmltv.sourceforge.net).

The library provides you with helpers to generate the xml, to work
with date and time, an abstract grabber model, a powerful regular
expression tool and an easy to customize html parser.





2 - Parser Models
#################

Until now we have grabbers using two parser models: 

 * The regular expression model, using the re2 module. This model
   parses an entire web page using one regular expression. If you are
   a master in regular expressions, you'll like this one, but be
   carefull since regexps are hard to read and maintain.

 * The customized parser model, using customizedparser module. This
   model uses a HTML parser customized to parse broken html (it can
   close some tags for you) and can filter just needed tags and
   attributes, embedding the non matching tag's contents in its
   parent.

These models are covered in more detail below.




2.1 - Regular Expression Model
==============================

This model downloads the web page and parses it using one big regular
expression (regexp for short). With all data at hand it does the
filtering based on user configuration.


There is no secret other than creating the regexp.

   re2 module
   ----------
   This module that makes this model possible. When you provide
   some expression with named groups they will be returned as
   dict. Also, when the group is matched more than once, it's returned
   as a list, thus if you have a regular expression matching table
   rows that represent channels, you should get a list of results
   representing the list of rows.

   Check re2 module documentation for usage info.



Chris Ottrey is the maintainer of the guide using this model, so
contact him if you want to develop something similar.

Here's the diagram:

                                 +-prog----+
                                 | Option  |
                                 | Prog    |
                                 +---------+
                                      ^
                                      |
                                 +-grab---+
                                 | Grab   |
                                 | Grab_C |
                                 +--------+
                                      ^
                                      |
                                 +-tv_grab_xx_yyy--------------+
                                 | Grabber                     |
                                 |  o    o    o    +-mytypes-+ |
                                 |  |    |    +----| Date    | |
                                 |  |    |         | Time    | |
                                 |  |    |         +---------+ |
                                 |  |    |         +-re2-----+ |
 http://source.url.tv/guide      |  |    +---------| compile | |
 --------------------------------|--|--------------|>extract | |
                                 |  |              +---------+ |
                                 |  |              +-xmltv---+ |
  TV.xml                         |  +--------------| XMLTV   | |
  <------------------------------|-----------------|-__str__ | |
                                 |                 +---------+ |
                                 +-----------------------------+
 

+---------------------------------------------------------------------+
| MODULE                                                              |
|   CLASS                                                             |
+---------------------------------------------------------------------+
| unittest2 - conducts unit tests on modules                          |
|   UnitTestException - raises this when unit test fails              |
|   UnitTest          - displays unit test results                    |
|                                                                     |
| prog      - program template                                        |
|   Option            - command line option to a program              |
|   Prog              - a program                                     |
|                                                                     |
| re2       - extracts a hierarchical object from a recursively       |
|             matching regex                                          |
|                                                                     |
| datetime2 - creates types                                           |
|   Time              - a time type created from various string fmts  |
|   Date              - a date type created from various string fmts  |
|                                                                     |
| xmltv     - creates an xmltv file                                   |
|   XMLTV             - creates an xmltv output                       |
|                                                                     |
| grab      - template class for a tv_grabber                         |
|   Grab              - tv_grabber                                    |
|   Grab_C            - configurabel tv_grabber                       |
+---------------------------------------------------------------------+





2.2 - Customized Parser Model
=============================

This model uses a customized html parser that can fix some not closed
tags (like <option>) and select just some tags and attributes to
parse, thus eliminating the unwanted ones. The result is a Tag
structure representing the document, this structure should then be
traversed and the required fields extracted, assembling the XMLTV
structure.

   CustomizedParser
   ----------------
   This class eliminates unwanted tags embedding the unwanted
   children tag contents into the parent. Imagine you want just the
   <a> tag from the above structure:

      <a><b>test</b></a>

   Customized parse will make it <a>test</a>. 

   The unwanted attributes are removed.


Gustavo Barbieri was the first to came with this model, but other
grabbers follow, so you can ask him or some other author following
this model.






3 - Abstract Grabbers
#####################

To write a grabber you must write the parser and an user interface. To
make both users and developers lives easy we provide a common class
that will keep the user interface consistent acrross grabbers and
help developers with this boring task. This class is called Grab.

If you want your grabber to provide user configuration so it grabs
just a set of the entire channels, use the Grab_C class, which
inherits Grab.

Grab provides you with an interface to grab many days, provide an
offset, so you grab in future or past, caching support and even
verbosity control.

Grab_C provides also an easy way to read and write config files. In
future we plan to provide a common way to add and remove channels to
be grabbed, togheter with a way to change the output channel ID, today
each grabber do it's own work.




4 - How To Write a Grabber
##########################

Before starting you must know how your provider works, then figure out
what you have to do in order to get the channels, programs and program
information. After that you must study the provider html and think
about the parser model you'll use.

Most grabbers use the 3 layer guides:
 1) some page with a list of every channel supported and links to
    those channel's programs for a given day.
 2) pages with the channel's programs for some day, with links to 
    detailed program information.
 3) pages with detailed program information.

Generally the first page also provide a time window with programs
showing, arranged like a matrix, with channels on the vertical, time
at the horizontal and each pair (channel, time) defines a program.

The first idea is to parse various pages like that, gluing every
window to make a day.   **This is often a bad idea!** If your guide
provide the 3 layer design, you should use it!

At the moment, the br_uol, nz_xtra, be_tlm and be_tvb use this
paradigm and it works well. You can use these guides as
example. Actually, the Belgium grabbers were written based on the
Brazilian one without trouble. It worth noting that the Belgium
grabbers use threads to speed up downloading.
