Integration Guide - URL Categorization Database Documentation

Table of Contents

Getting Started
Integration Patterns
Handling Uncategorized Domains
Update Process

Getting Started

Everything you need to know to begin working with the URL Categorization Database.

What You Receive

CSV file(s) containing the complete database
Access to this documentation
Email support for integration questions

File Format

CSV, UTF-8 encoded, comma-delimited. The first row contains column headers. Refer to the Data Dictionary for complete field definitions.

File Size Expectations by Tier

Tier	Approximate Uncompressed Size
5M domains	~1.5 GB
20M domains	~6 GB
30M domains	~9 GB
60M domains	~18 GB
102M domains	~30 GB

Recommended Workflow

Download

Decompress

Import

Index on domain

Query

Integration Patterns

Choose the integration pattern that best fits your architecture and performance requirements.

Pattern 1: Direct Database Import

Relational Database

Best for: Application backends, content management systems, analytics platforms

PostgreSQL

CREATE TABLE url_categories (
    domain VARCHAR(255) PRIMARY KEY,
    iab_v2_tier1 VARCHAR(255),
    iab_v2_tier2 VARCHAR(255),
    iab_v2_tier3 VARCHAR(255),
    iab_v2_tier4 VARCHAR(255),
    iab_v3_tier1 VARCHAR(255),
    iab_v3_tier2 VARCHAR(255),
    iab_v3_tier3 VARCHAR(255),
    iab_v3_tier4 VARCHAR(255),
    web_filtering_category VARCHAR(100) NOT NULL,
    personas TEXT,
    open_page_rank DECIMAL(4,1),
    country VARCHAR(100)
);

COPY url_categories FROM '/path/to/database.csv'
WITH (FORMAT csv, HEADER true, ENCODING 'UTF8');

CREATE INDEX idx_domain ON url_categories (domain);
CREATE INDEX idx_web_filter ON url_categories (web_filtering_category);

MySQL

CREATE TABLE url_categories (
    domain VARCHAR(255) PRIMARY KEY,
    iab_v2_tier1 VARCHAR(255),
    iab_v2_tier2 VARCHAR(255),
    iab_v2_tier3 VARCHAR(255),
    iab_v2_tier4 VARCHAR(255),
    iab_v3_tier1 VARCHAR(255),
    iab_v3_tier2 VARCHAR(255),
    iab_v3_tier3 VARCHAR(255),
    iab_v3_tier4 VARCHAR(255),
    web_filtering_category VARCHAR(100) NOT NULL,
    personas TEXT,
    open_page_rank DECIMAL(4,1),
    country VARCHAR(100)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

LOAD DATA INFILE '/path/to/database.csv'
INTO TABLE url_categories
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

Pattern 2: Memory-Mapped Flat File

High Performance

Best for: Network appliances, firewalls, proxies, DNS filters — anywhere sub-microsecond lookups matter

Steps:

Sort the CSV by domain column
Use memory-mapped file I/O (mmap) to map the sorted file into your process's address space
Implement binary search for O(log n) lookups

Typical performance: Nanosecond-scale on modern hardware. For a 20M domain database, a lookup requires ~24 comparisons.

Memory footprint: Approximately the file size. The OS page cache handles the rest — frequently accessed pages stay in RAM automatically.

#include <sys/mman.h>
#include <fcntl.h>

// Memory-map the sorted database file
int fd = open("sorted_database.csv", O_RDONLY);
struct stat sb;
fstat(fd, &sb);
char *data = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

// Binary search for domain lookup
// Implementation depends on your line format
const char* lookup_domain(const char *data, size_t size, const char *domain) {
    // Binary search implementation
    // Returns pointer to matching line or NULL
}

Pattern 3: Redis / Key-Value Store

Web Applications

Best for: Web applications, API services, microservices architectures

Bash

# Load domains into Redis using redis-cli pipe mode
awk -F',' 'NR>1 {print "SET \""$1"\" \""$10"|"$2"\""}' database.csv | redis-cli --pipe

Python

import redis
r = redis.Redis()

# Lookup a domain
result = r.get("example.com")
if result:
    category, iab = result.decode().split("|")
    print(f"Category: {category}, IAB: {iab}")

The 5M domain database requires approximately 2–3 GB of Redis memory. For 20M+ domains, ensure sufficient RAM or use Redis Cluster.

Pattern 4: C++ SDK (Available on Request)

Maximum Performance

Best for: The highest performance requirements with zero-allocation lookups

For the highest performance requirements, we offer a compiled C++ SDK with:

Hierarchical domain resolution with compiled binary lookup tables
Automatic CDN subdomain handling (e.g., r2---sn-xxxxx.googlevideo.com resolves to YouTube)
Multi-level TLD support (.co.uk, .com.au, etc.)
Sub-500ns lookup latency
Thread-safe, zero-allocation lookups after initialization

Handling Uncategorized Domains

When a URL is not found in the database, your application needs a fallback strategy. Choose the approach that fits your security posture.

Block by Default

Safest approach. Uncategorized domains are blocked. Suitable for high-security environments.

Note: causes false positives for newly registered domains.

Allow by Default

Uncategorized domains are allowed through. Suitable for less restrictive environments. Combine with other security layers (malware scanning, reputation checks).

Real-Time API Fallback

Query our real-time classification API for domains not in the database. Adds ~50ms latency per uncategorized lookup. Contact us for API access.

The CrUX-based databases cover virtually all domains that real users visit, ensuring comprehensive coverage of active browsing traffic.

Update Process

Keep your data current with our regular update cycle.

Update Schedule

Quarterly updates delivered as full replacement CSV files
Monthly updates available on Professional and Enterprise tiers
Recommended approach: double-buffer (load new file in background, swap pointer, release old file)
Zero downtime during updates

Changelog

A changelog is provided with each update showing:

New domains added — freshly categorized domains entering the database
Domains removed — expired, parked, or delisted domains
Category changes — reclassified domains with updated categories

Last Updated: March 2026