PermaLink Browsing the web with LotusScript02/16/2006
Show-n-tell thursday
Rocky and Bruce came up with a cool idea, and I thought I'd take part in it.

Technorati tag:

Lately, I have made several agents that read web pages and analyze the contents, and I know that I'm not the only one doing this. However, most people seem to use either Java or the MS COM object WinHttp.WinHttpRequest.5.1. I don't. I haven't yet learned how to use Java, and I found limitations in using the MS COM object. Instead, I use the built-in functionality of the Notes/Domino Web Navigator with LotusScript.

This is how I use this functionality, and hopefully, it can help someone else out there.

The Web Navigator is actually a very competent web browser, even if it doesn't reach the functionality of Firefox, or even Internet Explorer. From what I can tell, it seems to be a little bit slower than the MS COM object, but definately easier to handle.

Prereq's
  • Domino server needs to run the Web task
  • Database needs the forms HTMLForm and Cookie
  • Database needs the view (Cookies) | Cookies
If you plan to use cookies you will need a real form with anumber of fields on it, and it can be copied from the personal web retriever template on local, otherwise you can just create an empty form. I've never used cookies in this context, so I don't really know anything about how to use them.

If you want to display a web page in the client, the HTMLForm needs at least the Rich Text field Body, otherwise it can be empty.

Moving right along...
We need to create an agent to hold the LotusScript code, so here is an example of an agent. It doesn't do anything useful, but it shows what is needed to get the job done:
Option Public Option Declare Function CleanHTML(Byval s As String) As String Dim crlf As String Dim lf As String Dim sp As String Dim sp2 As String Dim pos As Long crlf = Chr$(13) + Chr$(10) lf = Chr$(10) sp = Chr$(32) sp2 = Chr$(32) + Chr$(32) 'Find CR + LF and replace with Space pos = Instr(s, crlf) Do While pos > 0 s = Left$(s, pos - 1) + sp + Right$(s, Len(s) - pos - 1) pos = Instr(s, crlf) Loop 'Find LF and replace with Space pos = Instr(s, lf) Do While pos > 0 s = Left$(s, pos - 1) + sp + Right$(s, Len(s) - pos) pos = Instr(s, lf) Loop CleanHTML = Fulltrim(s) End Function Sub Initialize Dim session As New NotesSession Dim database As NotesDatabase Dim webpage As NotesDocument Dim item As NotesItem Dim mime As NotesMimeEntity Dim url As String Dim html As String Dim lhtml As String 'Lower case HTML 'Get current database Set database = session.CurrentDatabase 'Make sure that we get HTML session.ConvertMime=False 'Set the URL to use url = "http://www.ibm.com/" 'Get the web page Set webpage = database.GetDocumentByURL(url, 1, 1,,,,,, False) If webpage.HttpStatus(0) = "200" Then 'Get the HTML out of the web page Set item = webpage.GetFirstItem("Body") Set mime = item.GetMIMEEntity html = CleanHTML(mime.ContentAsText) lhtml = Lcase(html) 'Continue to extract whatever information is needed from the HTML End If 'Make sure we delete the created document when we are finished with it Call webpage.Remove(True) End Sub
(Options)
I assume that everyone always use Option Declare in all LotusScript, don't you?

Function CleanHTML
I made the CleanHTML function to remove any line breaks to make it easier to search the HTML. Since line breaks can be either CR + LF or just LF, I need to search for both and replace them with space.

Function Initialize
As for almost anything we do in LotusScript, we need a NotesSession, a NotesDatabase and a NotesDocument. In this case we also need a NotesItem and a NotesMimeEntity.

The line session.ConvertMime=False is crucial. If we miss that one, we won't get a single line of HTML, since Domino will convert the HTML into Notes Rich Text.

Set webpage = database.GetDocumentByURL(url, 1, 1,,,,,, False) is the line that actually gets the web page for us. There are lots of parameters to this function, but I ordinarily only use the ones listed here. They are:
  • url: URL$: String. The desired uniform resource locator (URL), for example, http://www.acme.com. You can enter a maximum string length of 15K.
  • 1: reload: Integer. Optional. Enter 1 (True) to reload the page from its Internet server. Enter 0 (False) to load the page from the Internet only if it is not already in the Web Navigator database. Enter 2 to reload the page only if it has been modified on its Internet server. The default value is 0.
  • 1: urllist: Integer. Optional. Web pages can contain URL links to other Web pages. You can specify whether to save the URLs in a field called URLLinksn in the Notes document. (The Web Navigator creates a new URLLinksn field each time the field size reaches 64K. For example, the first URLLinks field would be URLLinks1, the second would be URLLinks2, and so on.). Specify 1 (True) if you want to save the URLs in the URLLinksn field(s). Specify 0 (False) or omit this parameter if you do not want to save the URLs in the URLLinksn field(s). If you save the URLs, you can use them in agents; for example, you could create an agent that opens Web pages in the Web Navigator database and then loads all the Web pages saved in each of the URLLinksn field(s).
  • False: returnimmediately: Boolean. Optional. Specify True to return immediately and not wait for completion of the retrieval. Specify False or omit this parameter to wait for completion of the retrieval. If you specify True, GetDocumentByURL does not return the NotesDocument object representing the URL document. This parameter is useful for retrieving a URL document for offline storage purposes; in this case, you do not need the NotesDocument object and do not need to wait for completion of the operation. This parameter is ignored and False is forced if the database being opened is not local to the execution context.
There are also parameters for username and password, both for web sites and for proxy servers, that I haven't used here. For the use of these, consult the Domino Designer Help for the NotesDatabase object.

The beauty of this NotesDatabase function is that we will always get a NotesDocument out of it, no matter if any web page was found or not, so we don't have to check if we got a NotesDocument, just check the HTTPStatus field for the text value "200", that indicates success. Any other value indicates a failure.

Get the NotesItem Body out of the NotesDocument. Get the NotesMimeEntity out of the NotesItem. Finally, get the NotesMime.ContentAsText out of the NotesMimeEntity, and we are home free. I usually use two variables for the HTML. One to store the (cleaned) HTML, and one to store the lower case HTML. I haven't measured the actual time, but I have a feeling that the Instr function will go a little bit faster if it doesn't have to compare text to both upper and lower case. Also, remember that a web page, even HTML only, can be a very long string!

I often have as many as six positional variables (pos1 to pos6) for use with Instr to find what I'm looking for in the HTML. The hard part is searching for something that is unique, and that will always exist in the web page.

Before exiting the Initialize function, we need to remember to delete the web page NotesDocument so it won't stay in the database for ever. It's easy to overlook, and you'll soon start to wonder why the database is growing in size. If the web page is to be kept within the database, we should change the form to something else. I usually start a web browsing agent with cleaning out any HTMLForm documents it can find, in case the agent has been terminated before deleting it.

Now, HTML web pages are not the only thing that can be opened with this function. It can just as easily be any file type you can think of. If it isn't a text file, it will simply be stored as a file attachment in the document. If True is specified as the last parameter, the function won't wait for the returned web page, and you can have one agent initiate the download of content, and another one to process the documents later.

You need to check the HTML headers? Don't worry, they are all stored within the NotesDocument. The field HTTPHeaders contains the names of all HTTP header fields, and the METAHeaders field contains the names of all META header fields. If you specify a 1 for the third parameter, you will also get all links collected in fields named URLLinks1, URLLinks2 and so on. There is also a field named $ImageList that contains the path to all images on the web page. The field URL contains the web address for the page.

LotusScript converted to HTML using Format Your LotusScript provided by Joe Litton

Update: I made a tiny mistake when publishing the code, so I have updated it. The line If webpage.HttpStatus = "200" Then was changed to be If webpage.HttpStatus(0) = "200" Then.
This page has been accessed 2683 times. .
Banana Home
klaseliten.gif

About me
Please ignore this link: webadresser
Banana Application
Me on Notes.Net
Other stuff
By Category
Monthly Archive
Contact Me
BlogRoll
Technorati tags
Lotus Domino ND8 Lotus Domino ND7 RSS News Feed Blog Admin Lotus Geek OpenNTF BlogSphere