12 July 2001 Contact: Daniel Hellerstein (danielh@crosslink.net) CheckLink ver 1.13c: Create, display, traverse, and index a web-tree Abstract: CheckLink is a multi-threaded, socket aware utility used to create, verify, traverse, and index a web-tree; where "web-tree" is defined as all URLs (in-line images, anchors, etc.) that are referenced in a chosen HTML document, and in documents reachable from this document. CheckLink can be run as an SRE-http addon, or from an OS/2 command prompt. ------------------- Contents: 1. Introduction 1.a Quick Start 1.b. Web Tree? Does that make sense? II. Installation II.a. Installing as an SRE-http addon. II.b: Using CheckLink as a standalone program III. CheckLink parameters. III.a. A Note on How CHEKLINK displays results III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters IV. CHEKLINK Options -- Create a Web Tree V. CHEKLNK2 Options -- Display and Traverse a Web Tree VI. CHEKINDX -- Create an Index of a Web Site VI.a CHEKINDX Options VI.b CHEKINDX edit mode VII. CHEKRPT report writer -- Report information about a Web Tree VIII. CHEKFIX "fix" busted URLs-- Note busted links in files that contain them IX. Notes X. Disclaimer ------------------- I. Introduction CheckLink is a robot that is used to create, verify, traverse and index a web-tree. In other words, CheckLink will find and variously display all the URLs (such as anchors and in-line images) that appear in a set of HTML documents. In particular, CheckLink will: ... given a "Starter-URL" provided by a client: a) use TCP/IP socket calls to obtain the contents of the html document (that this "Starter-URL" points to) Alternatively, in standalone mode you can use the FILE:///filename.ext syntax to read & process a file on your hard disk. b) find URLs referred to by this document (i.e., cd \internet\cheklink D:\INTERNET\CHEKLINK>cheklink When run in standalone mode, the i/o interface is somewhat primitive (no mouse, no graphics), and the final output is HTML code -- it is meant to be viewed with a browser. Otherwise, the results are the same as when run as an SRE-http addon (it might even be a touch faster). IMPORTANT NOTE: To use CheckLink as a standalone program, you MUST have REXXLIB.DLL. REXXLIB was a commercial package, which now seems to be in the public domain now. Regardless, you can obtain a legal-to-use-with-CheckLink version of REXXLIB.DLL from: http://www.srehttp.org/apps/cheklink/chekdll.zip If you are running an OS/2 web server that understands CGI-BIN (most of them do), then you should copy the CHEKLNK2.CMD and CHEKINDX.CMD files to your CGI-BIN scripts directory. The output from CHEKLINK can be instructed to include appropriate calls to CHEKLNK2. In addition, you can use the CHEKLINK.HTM "front end" to invoke both of these utilities. Thus, to use CheckLink in a non-SRE-http environment, you will a) Run CHEKLINK.CMD, from an OS/2 command prompt, to generate the index of a web-tree, and to produce several tables of results. BE SURE TO SAY Yes when asked: "Use CGI-BIN to specify CHEKLNK2 (web traversal) links?" b) Invoke CHEKLNK2.CMD and CHEKINDX.CMD as CGI-BIN scripts One way to do this is to ... Invoke CHEKLNK2.CMD or CHEKINDX.CMD from CHEKLINK.HTM -- you'll need to make a few simple modifications to CHEKLINK.HTM (see CHEKLINK.HTM for the details) Alternatively, you can run CHEKRPT.CMD as standalone programs. CHEKRPT is not quite as powerful as CHEKLNK2, but it does have a number of nice report writing features, and the HTML documents it produces give you a limited amount of "web tree traversal" opportunities. ------------------- III. CheckLink parameters. Regardless of how you run CheckLink, you may wish to first adjust several performance-tuning and display-customization parameters. Most of these appear at the top of the CHEKLINK.CMD, and there are a few in CHEKLNK2.CMD, CHEKRPT.CMD, and CHEKINDX.CMD -- you should modify these files with your favorite text editor. Note that to use any of the CheckLink programs you do NOT need to set these parameters -- the default values work reasonably well. However, if you intend to make more then occasional use of CheckLink, we recommend setting the LINKFILE_DIR parameter in CHEKLINK.CMD, CHEKLNK2.CMD, CHEKRPT.CMD, and CHEKINDX.CMD. ------------------- III.a. A Note on How CHEKLINK displays results Before further discussion, a note on how CHEKLINK (the web-tree creator) displays results (when run as an SRE-http addon) is germane: CHEKLINK can return results either in one long document, as a "two part" document, or in two separate documents. In a "two part" document: The first part contains status information, and is sent to the client in pieces. The second part contains the results tables. In a "long document" these parts are concatenated -- the final output contains both "status" and "results" information (and will be a bit more cluttered as a result) Since CHEKLINK can take several minutes to process a thousand or so links, the production of "status" information is crucial. In fact, this status information is "sent in pieces" -- with some sort of output being sent to the client every few seconds. Not only does this help keep the client from giving up, it also prevents "server inactive" timeouts. In fact, it's this "may take several minutes to finish" aspect of CHEKLINK that makes it very difficult to distribute a pure CGI-BIN version of CHEKLINK -- most CGI-BIN implementations do NOT allow for "sending results as they become available", and one can not count on lengthy (i.e., more then a few minutes) inactive-timeouts. Although two-part documents are the more elegant solution, with certain browsers some very annoying "over refresh" behavior occurs (i.e., every time you "back up" to the results, CHEKLINK is reinvoked). As a work around, the "two document" strategy can be used, which will result in almost the same display as a two-part document (client pull is used to automatically replace the "status" document with the "results" document). The drawback is the requirement for semi-permanent storage of the results file on your server's disk -- you may need to monitor disk space if you allow CHEKLINK to be extensively used in two-document mode. ------------------- III.b. CHEKLINK, CHEKLNK2, and CHEKINDX parameters BACK_1 : modifiers. BACK_2 BACK_1 and BACK_2 are used to set a BGCOLOR (or BACKGROUND) for the "two parts" of CheckLink's output. Note that if you are using CheckLink in single-part mode (i.e., if you are using an older web browser, or if you set the MULTI_USE option to 0) BACK_2 is ignored. Examples: BACK_1='bgcolor="#668a78"' BACK_2='bgcolor="#8888dd" background="CL.GIF' Note: BACK_1 (BACK_2) is ignored if INTRO_1A (INTRO_1B) is set to a non-null value. CHEKLINK_HTM : URL pointing to CHEKLINK.HTM CHEKLINK_HTM should contain a URL (usually, a relative URL) that points to the CHEKLINK.HTM file shipped with CheckLink. This variable is used to add a "generate another web-tree" option to the output file. Thus, neglecting to properly set CHEKLINK_HTM will have minimal deleterious effects. Example: CHEKLINK_HTM = '/CHEKLINK.HTM' CHECK_ROBOT : Suppress checking ROBOTS.TXT. If CHECK_ROBOT=1, then check the "Starter-URL" site for a /robots.txt file, and use it to control extent of search. Proper net'iquette dictates that when checking a stranger's site, make sure you have set CHECK_ROBOT=1. Note: the contents of a ROBOTS.TXT file are added to the special "site-specific" EXCLUSION_LIST -- it only effects URLs on the "Starter-URL" site. Example: CHECK_ROBOT=1 DOUBLE_CHECK: Since servers can be momentarily busy, it's often wise to "double check" busy servers. DOUBLE_CHECK=0 : do NOT double check DOUBLE_CHECK=1 : double check "inaccessible servers" DOUBLE_CHECK=2 : double check "inaccessible servers" AND "missing resources" Double checking will occur after all links have been examined (thus giving the "not available" server a chance to become available. Lastly, GET queries are used (instead of HEAD queries). However, HTML documents retrieved via a double check will NOT be "recursively processed, even if they should have been (even if they had not required this double check). GET_QUERY: As part of mapping a web-tree, CheckLink will query servers for basic information on URLs. These queries are best done with HEAD requests. Unfortunately, there are a number of older servers that do not properly respond to HEAD requests. If you find that CheckLink is identifying many URLs as unavailable (even though your browser can get to them readily), it may be due to their host server's failure to recognize these HEAD requests. As a work around, you can use short GET requests instead of HEAD requests. This method is engaged by setting GET_QUERY=1. Example: GET_QUERY=0 Note: This GET_QUERY=1 method is not highly recommended -- it's slower, and somewhat "ruder" (connections are purposely broken, which tends to add garbage to the visited server's log file). Instead, we recommend setting DOUBLE_CHECK=1 LINKFILE_DIR: directory to store "linkage" files in. Linkage files contain "link" information on all the URLs discovered during CheckLink's recursive mapping of a "web tree". In particular, the LINKFILE option (see section IV) specifies a filename, which will then be stored in the LINKFILE_DIR. By default, LINKFILE_DIR will be your OS/2 TEMP drive. Example: LINKFILE_DIR='D:\GOSERVE\CHKLNKS' Note: in addition to storing LINKFILEs, the LINKFILE_DIR is also used to store "RESULTS" files. MAXATONCE: maximum number of "query" threads Specifies the maximum number of threads to use when checking for the existence (and mimetype) of a link (using HEAD requests). Increasing this number may speed up throughput, but it may subject the target server(s) to excessive loads. Example: MAXATONCE=6 MAXATONCE_GET: maximum number of "read" threads. Specifies the maximum number of threads to use when retrieving the contents of a URL (using GET requests). Increasing this number may speed up throughput, but it may subject the target server(s) to excessive loads. Example: MAXATONCE_GET=2 MAXAGE: Kill a query if it's old Specifies number of seconds to wait on a query (a HEAD request). You may need to increase this time span if sites are far away or otherwise slow. However, increasing MAXAGE will increase the time that CheckLink waits on "hung" sites. Example: MAXAGE=30 MAXAGE2: Kill a read if it's old Specifies number of seconds to wait on a read (a GET request). You may need to increase this time span if sites are far away or otherwise slow. However, increasing MAXAGE will increase the time that CheckLink waits on "hung" sites. Example: MAXAGE2=60 PROXY_SERVER: Specify a proxy server to route request through The proxy server to send http requests through. Use an IP name or numeric address, with optional port. If you are NOT using a proxy server, set this to 0 Examples: PROXY_SERVER='voxy.mycompany.com:8080' PROXY_SERVER=0 ROW_COLOR1 : Used to set the in the results tables ROW_COLOR2 ROW_COLOR1A ROW_COLOR2A ROW_COLOR1 and ROW_COLOR2 set the odd and even rows (respectively) of tables used to display the results of checking IMG links. ROW_COLOR1A and ROW_COLOR2A set the odd and even rows (respectively) of tables used to display the results of checking Anchor links. Examples: ROW_COLOR1='bgcolor="#bbcc66"' ROW_COLOR2='bgcolor="#aaccdd"' ROW_COLOR1A='bgcolor="#bbaa44"' ROW_COLOR2A='bgcolor="#aaccdd"' REMOVE_SCRIPT: Remove