URL Database File Format Description

Here we document the textual file format used to store the URL database.

Overall Database

The database consists of a directory containing multi-document YAML text files. Line endings should ideally be native to the host operating system, and files are Unicode text encoded in UTF-8.

Each text file contains information about URLs specific to one domain (or CNAMEd set of equivalent domains). For the domain example.com, the name of the file should be example.com.yaml.

Domain File

Each domain file is a multi-document YAML text file. The overall textual structure of the file looks like:

---
<YAML document 1>
---
<YAML document 2>
---
<etc>
---
<final YAML document>

The first YAML document in the domain file is a set of metadata about this domain. The remaining YAML documents are individual records defining URLs to be tracked. To ease tracking with version control systems, the “record” documents are sorted in the overall file by their _path key. If you want to add a record to a domain file, the ideal workflow is to write a new, temporary file with the record inserted in the appropriate position, and then use an atomic rename to replace the existing file with your updated temporary file.

Also to ease version control, each YAML document in the domain file should be written with its dictionary keys sorted, list items sorted when appropriate, etc.

The minimal example domain file is:

---
---
_path: /
content-type: text/html

Domain Metadata Document

The domain metadata YAML document is a dictionary at its top level. It can contain the following keys.

case-sensitive-paths

A boolean, defaulting to true. If false, the webserver for this domain is case-insensitive in how it handles URL paths: example.com/hello and example.com/HeLlo are equivalent. For such servers, the capitalization of URL paths should be normalized in the database.

cnames

A sorted list of aliases to the domain in question, not including the “primary” domain as defined by the filename of the domain file. The equivalence relation here is as in a DNS CNAME: every URL associated with the primary domain should function identically to a URL associated with a cname alias as well.

For example, if the file example.com.yaml has the following content:

---
cnames:
- www.example.com
---
_path: /
content-type: text/html

Then the URLs http://example.com/ and http://www.example.com/ should both function equivalently.

https

A boolean, defaulting to false. If true, the webserver for this domain is expected to support HTTPS access as well as unencrypted HTTP.

URL Record Document

Each URL record YAML document is a dictionary at its top level.

Path and Content Type

At a minimum, each record must contain a key called _path that provides an absolute URL path defining the URL in question. It must also contain a key called content-type that records the HTTP Content-Type of the returned document. A minimal record document might simply consist of:

_path: /robots.txt
content-type: text/plain

The path may contain path parameters (semicolon-delimited) and query parameters (ampersand-delimited) but not a fragment specifier (octothorpe-delimited).

A URL record containing just a path indicates that an HTTP GET request to the URL defined by combining the domain name in question and the path in question should return an HTTP 2xx or 3xx status code. The document example.com.yaml containing:

---
---
_path: /index.html?foo=bar
content-type: text/html

Indicates that the URL http://example.com/index.html?foo=bar should be successfully accessible in this way.

Content types may contain parameters, e.g.:

---
_path: /about/
content-type: text/html; charset=utf-8

Categories

Records can be assigned arbitrary textual categories. These are stored in the YAML as a sorted list of strings under the key categories. Example:

---
_path: /favicon.ico
categories:
- frontend
- graphics
content-type: image/x-icon

Categories can be assigned upon URL registration by using the --category flag to wwturldb add. The flag can be specified more than once, or not at all.

Static Content

If the record contains the key content-length, that indicates that the content returned by the server for the URL request (following redirects) should have exactly the length specified by the record. The value in the record is an integer number of bytes.

If the record contains the key content-sha256, the indicate that the SHA256 digest of the the content returned by the server should match the value specified by the record. The value in the record should be a lowercase hexadecimal expression of the digest.

Example:

_path: /m51.txt
content-length: 1650240
content-sha256: fd3589aa8a72beb48939de884e3ee5324b510c145003f375c77cd4ecb1a79672
content-type: text/ascii

These features are aimed at declaring “static content” that should not change over time. When adding a URL, giving the --static flag to wwturldb add causes these keys to be recorded in the database file.