Integration Guide

Step-by-step instructions for importing and querying the URL Categorization Database in your applications, infrastructure, and analytics platforms.

Getting Started

Everything you need to know to begin working with the URL Categorization Database.

What You Receive

File Format

CSV, UTF-8 encoded, comma-delimited. The first row contains column headers. Refer to the Data Dictionary for complete field definitions.

File Size Expectations by Tier

Tier Approximate Uncompressed Size
5M domains ~1.5 GB
20M domains ~6 GB
30M domains ~9 GB
60M domains ~18 GB
102M domains ~30 GB

Recommended Workflow

Download
Decompress
Import
Index on domain
Query

Integration Patterns

Choose the integration pattern that best fits your architecture and performance requirements.

Pattern 1: Direct Database Import

Relational Database

Best for: Application backends, content management systems, analytics platforms

PostgreSQL

PostgreSQL
CREATE TABLE url_categories (
    domain VARCHAR(255) PRIMARY KEY,
    iab_v2_tier1 VARCHAR(255),
    iab_v2_tier2 VARCHAR(255),
    iab_v2_tier3 VARCHAR(255),
    iab_v2_tier4 VARCHAR(255),
    iab_v3_tier1 VARCHAR(255),
    iab_v3_tier2 VARCHAR(255),
    iab_v3_tier3 VARCHAR(255),
    iab_v3_tier4 VARCHAR(255),
    web_filtering_category VARCHAR(100) NOT NULL,
    personas TEXT,
    open_page_rank DECIMAL(4,1),
    country VARCHAR(100)
);

COPY url_categories FROM '/path/to/database.csv'
WITH (FORMAT csv, HEADER true, ENCODING 'UTF8');

CREATE INDEX idx_domain ON url_categories (domain);
CREATE INDEX idx_web_filter ON url_categories (web_filtering_category);

MySQL

MySQL
CREATE TABLE url_categories (
    domain VARCHAR(255) PRIMARY KEY,
    iab_v2_tier1 VARCHAR(255),
    iab_v2_tier2 VARCHAR(255),
    iab_v2_tier3 VARCHAR(255),
    iab_v2_tier4 VARCHAR(255),
    iab_v3_tier1 VARCHAR(255),
    iab_v3_tier2 VARCHAR(255),
    iab_v3_tier3 VARCHAR(255),
    iab_v3_tier4 VARCHAR(255),
    web_filtering_category VARCHAR(100) NOT NULL,
    personas TEXT,
    open_page_rank DECIMAL(4,1),
    country VARCHAR(100)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

LOAD DATA INFILE '/path/to/database.csv'
INTO TABLE url_categories
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

Pattern 2: Memory-Mapped Flat File

High Performance

Best for: Network appliances, firewalls, proxies, DNS filters — anywhere sub-microsecond lookups matter

Steps:

  1. Sort the CSV by domain column
  2. Use memory-mapped file I/O (mmap) to map the sorted file into your process's address space
  3. Implement binary search for O(log n) lookups

Typical performance: Nanosecond-scale on modern hardware. For a 20M domain database, a lookup requires ~24 comparisons.

Memory footprint: Approximately the file size. The OS page cache handles the rest — frequently accessed pages stay in RAM automatically.

C
#include <sys/mman.h>
#include <fcntl.h>

// Memory-map the sorted database file
int fd = open("sorted_database.csv", O_RDONLY);
struct stat sb;
fstat(fd, &sb);
char *data = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

// Binary search for domain lookup
// Implementation depends on your line format
const char* lookup_domain(const char *data, size_t size, const char *domain) {
    // Binary search implementation
    // Returns pointer to matching line or NULL
}

Pattern 3: Redis / Key-Value Store

Web Applications

Best for: Web applications, API services, microservices architectures

Bash
# Load domains into Redis using redis-cli pipe mode
awk -F',' 'NR>1 {print "SET \""$1"\" \""$10"|"$2"\""}' database.csv | redis-cli --pipe
Python
import redis
r = redis.Redis()

# Lookup a domain
result = r.get("example.com")
if result:
    category, iab = result.decode().split("|")
    print(f"Category: {category}, IAB: {iab}")

The 5M domain database requires approximately 2–3 GB of Redis memory. For 20M+ domains, ensure sufficient RAM or use Redis Cluster.

Pattern 4: C++ SDK (Available on Request)

Maximum Performance

Best for: The highest performance requirements with zero-allocation lookups

For the highest performance requirements, we offer a compiled C++ SDK with:

Contact us at [email protected] for SDK access and documentation.

Handling Uncategorized Domains

When a URL is not found in the database, your application needs a fallback strategy. Choose the approach that fits your security posture.

Block by Default

Safest approach. Uncategorized domains are blocked. Suitable for high-security environments.

Note: causes false positives for newly registered domains.

Allow by Default

Uncategorized domains are allowed through. Suitable for less restrictive environments. Combine with other security layers (malware scanning, reputation checks).

Real-Time API Fallback

Query our real-time classification API for domains not in the database. Adds ~50ms latency per uncategorized lookup. Contact us for API access.

The CrUX-based databases cover virtually all domains that real users visit, ensuring comprehensive coverage of active browsing traffic.

Update Process

Keep your data current with our regular update cycle.

Update Schedule

Changelog

A changelog is provided with each update showing:

Last Updated: March 2026