Blog

Privacy and security by design, in practice: the choices (and the trade-offs) behind a real website

30 June 202610 min read

The principles of privacy and security by design only become useful once they turn into code. Here is how each one was applied, choice by choice, in a real website, trade-offs and legal grounds included: minimal data, cookieless analytics, a nonce-based CSP with an alarm, logs without personal data, defence at the edge.

The principles are easy to recite: minimise, keep little, lock down access, prevent injection. It is in the translation into code that it becomes clear whether they were words or choices. This article takes a real website, this one, and shows how each principle became a concrete decision. With a caveat that separates a guide from a serious reference: nothing here is "perfect and immovable". There are trade-offs, and there are defences that must be verified, not declared. So alongside each choice there is also what is given up, what is rejected outright, and what depends on context.

Collect less, and say so

The starting point is not "how do I protect the data I collect", but "which data is genuinely needed". The contact form asks for two mandatory things: an email address to reply to, and the message. No phone number, no profiling, name and company optional. A field that is not collected does not need protecting, cannot be lost in a breach, does not need keeping. Retention is decided up front: no longer than twelve months, then deletion.

Minimisation does not exempt anyone from transparency. Below the form sit the consent box and an unambiguous link to the privacy notice, which states the purposes, the legal basis and the retention periods. The obligation to inform, set out in Article 13 of the GDPR (Regulation (EU) 2016/679), still applies, even when the data is sparse.

Then there is spam, and here a clear distinction. The instinctive reaction is Google's reCAPTCHA: it has to be rejected outright, because it drops cookies and tracks the user, which destroys precisely the privacy being built. A defence that defeats its own purpose is not a defence. The privacy-preserving alternative is a combination: a honeypot field (an invisible box that humans never see and bots fill in: if it comes back populated, it is a bot, and the submission is discarded), a per-IP rate limit, and filtering at the edge, in the reverse proxy.

For a low-traffic form this combination is enough. Should spam ever become a serious problem, there is a further step, still privacy-respecting: proof-of-work systems, such as Altcha, which make the browser perform a small computation before submission. This makes automated mass-submission expensive for the bot, with no third parties and without tracking anyone. The contrast with reCAPTCHA is sharp: reCAPTCHA is rejected on principle, because it violates privacy in every case; proof-of-work is not rejected at all, it simply is not needed today. That is a context-dependent choice, not a rejection on principle.

Cookieless analytics, and the truth about IP addresses

Most sites measure traffic with a tool that installs cookies and loads third-party scripts. Those two ingredients, on their own, force a cookie banner and send visitors' data off to an external provider. Here the statistics are derived from the reverse proxy's access logs, analysed with an open-source tool. No script in the browser, no cookie, no third party: nothing to consent to, so no banner.

Two honest admissions are needed, though, because a technical reader knows them.

First, the trade-off. Without a session identifier, the "unique visitors" count is an estimate, not a surgical measurement. And a proxy's logs are full of bots and scanners. A less precise metric is accepted in exchange for total privacy and lightness. It is a stated choice, not a hidden limitation.

Second, the most important objection, worth raising up front: "an IP address is personal data; calling it 'anonymised' while writing it to disk in the clear does not add up". It is half right, and it is the point to set straight.

That an IP address is personal data is settled: the Court of Justice of the European Union held as much in Breyer (Case C-582/14, 19 October 2016). The ruling was handed down under the earlier Directive 95/46/EC, but the principle has carried over to the GDPR.
The next objection, "then truncate it at source, in the reverse proxy, before writing it to disk", cannot be accepted here, and not out of laziness. This site's abuse defence bans malicious IPs (scanners, brute-force attempts): to ban an IP, it needs to know the real IP. Truncating it at source leaves security blind, or ends up banning entire blocks and hitting innocent users. On a site without an IP-based defence, and with a purely statistical purpose, truncating at source would in fact be the right thing to do: it is a matter of context.
And keeping the IP for that purpose is not an abuse; it is a purpose the law recognises explicitly. Recital 49 of the GDPR (Regulation (EU) 2016/679), an interpretive recital of the regulation and not an article, states that processing personal data, to the extent strictly necessary and proportionate, to ensure network and information security, preventing unauthorised access and stopping attacks such as "denial of service", constitutes a legitimate interest of the controller. That is exactly what banning an IP that is scanning or attacking the site does. The operative legal basis is Article 6(1)(f) of the same Regulation; Recital 49 is its interpretive confirmation.
The right solution, then, is not blind truncation, it is separation of purposes. The full IP lives in the security log, for as long as is strictly necessary (here, thirty days). The statistical report, which is a different purpose, never shows an individual's IP: it masks it (last octet zeroed) and aggregates.

So the IP does not disappear entirely, because security needs it; but it is confined (to the security log), time-limited (thirty days), justified (Article 6(1)(f) of the GDPR, read in the light of Recital 49 of the same regulation) and never exposed as individual data in the statistics. This is privacy by design done properly: not "zero data at any cost", but the minimum data necessary, for the minimum time, on the right legal basis.

The armoured door, and the alarm

One of the commonest ways to attack a site is to inject a script that should not be there. The Content Security Policy is the armoured door: it tells the browser which scripts it may run and blocks the rest. Here the CSP uses a per-request nonce (a random value the browser demands on every script) with strict-dynamic, the strictest rule: only the nonce counts, not the origin. 'self' remains as a fallback for older browsers that do not understand third-generation CSP; modern ones ignore it and trust the nonce alone.

Content-Security-Policy:
  default-src 'self';
  script-src 'self' 'nonce-<random-per-request>' 'strict-dynamic';
  object-src 'none'; base-uri 'none'; frame-ancestors 'none';
  report-to csp-endpoint

But a "blind" armoured door is half the job. If someone attempts an injection, or if a legitimate script breaks after an update, how is it noticed? This is why the CSP here does not merely prevent, it reports: the report-to and report-uri directives tell the browser to send a report every time it blocks something, to an endpoint that records it. From "a locked door in the dark" to "a locked door with an alarm".

A common misconception needs heading off here: these reports are not for banning the attacker. They come from the victim's browser (it is the victim's browser that blocks and reports), not the attacker's, and they are noisy (extensions, antivirus) and forgeable. They are a diagnostic tool, not a means of capture: banning happens elsewhere, where the real IP of whoever is attacking is visible. And because the endpoint that collects the reports is public and unauthenticated, it has been hardened accordingly (a body-size limit, a rate limit, sanitisation of the logged values against log injection). How all of this is verified is the point a little further on.

Logs without personal data, and how to debug all the same

Logs are there to keep a site running and to defend it, but they are also where personal data quietly piles up. Here the content of a message, or any personal datum, is never written to the logs; and the technical logs are kept for thirty days, then rotated away.

The right question, for anyone who has run systems, is how to find one user's request among thousands when their email was never logged. The answer is a correlation id: for every error a short, non-secret code is generated, shown to the user and written in the log next to the technical reason. The user quotes the code, the developer finds the request. Full diagnosis, zero personal data in the logs. It is the essence of privacy by design applied to observability.

Defence at the edge, and secrets: knowing where to stop

Before the application even comes into play, the reverse proxy filters: the trap paths that scanners probe wholesale get a flat block, unnecessary HTTP methods are refused, and there is a limit on request-body size to cut off oversized payloads. Less surface, less noise.

Secrets never live in the code or in the container image: they exist only as environment variables injected at start-up. Here an experienced reader rightly objects that environment variables are not ideal: they can end up in a crash dump or be read by anyone with access to the container. True, and the answer is proportionality, not cargo-culting. Standing up a secret manager on a single server is a complexity, and a new attack surface, out of all proportion to the gain: it makes sense at multi-service scale, in a team, with orchestration and audit. The intermediate step, once more services are added, is to mount secrets as ephemeral files rather than populating the process environment. Saying "here is where it is right to stop, and here is where one would go further" is more mature than reaching for the biggest tool regardless.

Two points that often go missing. The form uses the framework's Server Actions, which perform the CSRF check natively (an origin comparison): no need to reinvent it. And there is no need for Subresource Integrity, because no third-party resources are loaded: the fonts and everything else are hosted on the domain. The best problem is the one removed at the root.

Defences are verified, not declared

There is a difference between writing "the endpoint is secure" and demonstrating it. The CSP report collector, being public, was attacked on purpose: a request with the wrong method gets a flat refusal, an oversized body is discarded before it is even read, an attempt to inject fake log lines is neutralised (newlines become spaces, no forged log entries), a burst of closely spaced requests is throttled. Every defence was put to the test, not merely written. That is what separates a system that is robust from one that merely looks robust.

The checklist, in short

Collect only the data that is indispensable; every extra field is one more risk.
Decide the expiry up front and automate deletion.
Measure without cookies and without third parties, owning the trade-off on precision.
Keep the IP only where it is needed (security), for the minimum time, on the right legal basis; mask it in the statistics.
A nonce-based CSP to prevent, reports to notice; never ban on the reports.
No personal data in the logs; a correlation id for debugging.
Filter at the edge, secrets out of the code; the right tool for the scale.
Verify the defences, do not declare them.

Not a cost, but a better system (and an honest one)

Building this way does not slow anything down: a site that collects less has less to protect, one without tracking is faster, a tight CSP with an alarm holds up and warns. Privacy and security by design are not a constraint to endure at the end, but the starting point that makes everything else simpler. And the mature version is not the one that declares itself perfect: it is the one that states what it accepts as a trade-off, what it rejects and why, and where it would go further. This site is that version.

For the general principles this article builds on:

Privacy by design: what it is and how to apply it

← All articles