Getting Started
Everything you need to know to begin working with the URL Categorization Database.
What You Receive
- CSV file(s) containing the complete database
- Access to this documentation
- Email support for integration questions
File Format
CSV, UTF-8 encoded, comma-delimited. The first row contains column headers. Refer to the Data Dictionary for complete field definitions.
File Size Expectations by Tier
| Tier | Approximate Uncompressed Size |
|---|---|
| 5M domains | ~1.5 GB |
| 20M domains | ~6 GB |
| 30M domains | ~9 GB |
| 60M domains | ~18 GB |
| 102M domains | ~30 GB |
Recommended Workflow
Integration Patterns
Choose the integration pattern that best fits your architecture and performance requirements.
Pattern 1: Direct Database Import
Best for: Application backends, content management systems, analytics platforms
PostgreSQL
CREATE TABLE url_categories (
domain VARCHAR(255) PRIMARY KEY,
iab_v2_tier1 VARCHAR(255),
iab_v2_tier2 VARCHAR(255),
iab_v2_tier3 VARCHAR(255),
iab_v2_tier4 VARCHAR(255),
iab_v3_tier1 VARCHAR(255),
iab_v3_tier2 VARCHAR(255),
iab_v3_tier3 VARCHAR(255),
iab_v3_tier4 VARCHAR(255),
web_filtering_category VARCHAR(100) NOT NULL,
personas TEXT,
open_page_rank DECIMAL(4,1),
country VARCHAR(100)
);
COPY url_categories FROM '/path/to/database.csv'
WITH (FORMAT csv, HEADER true, ENCODING 'UTF8');
CREATE INDEX idx_domain ON url_categories (domain);
CREATE INDEX idx_web_filter ON url_categories (web_filtering_category);
MySQL
CREATE TABLE url_categories (
domain VARCHAR(255) PRIMARY KEY,
iab_v2_tier1 VARCHAR(255),
iab_v2_tier2 VARCHAR(255),
iab_v2_tier3 VARCHAR(255),
iab_v2_tier4 VARCHAR(255),
iab_v3_tier1 VARCHAR(255),
iab_v3_tier2 VARCHAR(255),
iab_v3_tier3 VARCHAR(255),
iab_v3_tier4 VARCHAR(255),
web_filtering_category VARCHAR(100) NOT NULL,
personas TEXT,
open_page_rank DECIMAL(4,1),
country VARCHAR(100)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
LOAD DATA INFILE '/path/to/database.csv'
INTO TABLE url_categories
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
Pattern 2: Memory-Mapped Flat File
Best for: Network appliances, firewalls, proxies, DNS filters — anywhere sub-microsecond lookups matter
Steps:
- Sort the CSV by domain column
- Use memory-mapped file I/O (mmap) to map the sorted file into your process's address space
- Implement binary search for O(log n) lookups
Typical performance: Nanosecond-scale on modern hardware. For a 20M domain database, a lookup requires ~24 comparisons.
Memory footprint: Approximately the file size. The OS page cache handles the rest — frequently accessed pages stay in RAM automatically.
#include <sys/mman.h>
#include <fcntl.h>
// Memory-map the sorted database file
int fd = open("sorted_database.csv", O_RDONLY);
struct stat sb;
fstat(fd, &sb);
char *data = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Binary search for domain lookup
// Implementation depends on your line format
const char* lookup_domain(const char *data, size_t size, const char *domain) {
// Binary search implementation
// Returns pointer to matching line or NULL
}
Pattern 3: Redis / Key-Value Store
Best for: Web applications, API services, microservices architectures
# Load domains into Redis using redis-cli pipe mode
awk -F',' 'NR>1 {print "SET \""$1"\" \""$10"|"$2"\""}' database.csv | redis-cli --pipe
import redis
r = redis.Redis()
# Lookup a domain
result = r.get("example.com")
if result:
category, iab = result.decode().split("|")
print(f"Category: {category}, IAB: {iab}")
The 5M domain database requires approximately 2–3 GB of Redis memory. For 20M+ domains, ensure sufficient RAM or use Redis Cluster.
Pattern 4: C++ SDK (Available on Request)
Best for: The highest performance requirements with zero-allocation lookups
For the highest performance requirements, we offer a compiled C++ SDK with:
- Hierarchical domain resolution with compiled binary lookup tables
- Automatic CDN subdomain handling (e.g.,
r2---sn-xxxxx.googlevideo.comresolves to YouTube) - Multi-level TLD support (.co.uk, .com.au, etc.)
- Sub-500ns lookup latency
- Thread-safe, zero-allocation lookups after initialization
Contact us at [email protected] for SDK access and documentation.
Handling Uncategorized Domains
When a URL is not found in the database, your application needs a fallback strategy. Choose the approach that fits your security posture.
Safest approach. Uncategorized domains are blocked. Suitable for high-security environments.
Note: causes false positives for newly registered domains.
Uncategorized domains are allowed through. Suitable for less restrictive environments. Combine with other security layers (malware scanning, reputation checks).
Query our real-time classification API for domains not in the database. Adds ~50ms latency per uncategorized lookup. Contact us for API access.
The CrUX-based databases cover virtually all domains that real users visit, ensuring comprehensive coverage of active browsing traffic.
Update Process
Keep your data current with our regular update cycle.
Update Schedule
- Quarterly updates delivered as full replacement CSV files
- Monthly updates available on Professional and Enterprise tiers
- Recommended approach: double-buffer (load new file in background, swap pointer, release old file)
- Zero downtime during updates
Changelog
A changelog is provided with each update showing:
- New domains added — freshly categorized domains entering the database
- Domains removed — expired, parked, or delisted domains
- Category changes — reclassified domains with updated categories