We architected an edge caching layer to eliminate cold starts

Posted by skeptrune 1 day ago

Counter34Comment28OpenOriginal

Comments

Comment by infogulch 22 hours ago

Automatic version detection, revalidation, prewarming... caching seems so complicated these days. Forgive me for starting a sentence with "why don't we just"... but why don't we just use the hash of the object as the cache key and be done with it? You get integrity validation as a bonus to boot.

    <link rel="stylesheet" href="main.css?hash=sha384-5rcfZgbOPW7..." integrity="sha384-5rcfZgbOPW7..."/>

    Etag: "sha384-5rcfZgbOPW7..."
    Cache-Control: max-age=31536000, immutable

Comment by rob74 21 hours ago

Sure, but where's the fun in that? Then you wouldn't be able to write "we architected a caching layer"! To their credit, at least this isn't the actual title of the article, but it still left me wondering if an actual architect (you know, the kind of architect that designs buildings) would say "I architected this"?

Comment by skeptrune 21 hours ago

Because you want the ability to invalidate the cache for an entire site at the same time. So you would still need some map between domain and hash.

Comment by infogulch 6 hours ago

You don't need to invalidate anything if the cache is keyed on the hash of the served objects. To put it another way, a hash-keyed cache results in perfectly precise, instant, distributed cache invalidation. Read the code in my comment again.

Comment by pyrolistical 21 hours ago

I just don’t get it. Their last paragraph describes how they changed their dynamic site to be static. So then why do you need workers at all? Just deploy to a CDN.

How do you do version updates? Add content hash to all files except for root index.html.

Cache everything forever, except for index.html

To deploy new version upload all files, making sure index.html is last.

Since all files are unique, old version continues to be served.

No cache invalidating required since all files have unique paths, expect index.html which was never cached.

You have to ensure you absolutely have properly content hashes for everything. Images, css, js. Everything

Comment by Borealid 20 hours ago

What happens in the event of a hash collision?

Comment by pyrolistical 20 hours ago

Have the deploy fail and dont update index.html, users stay in current version.

For example cloudfront with s3, you use If-None-Match when uploading to ensure deploy fails on conflict

Comment by immibis 13 hours ago

You win a million dollar prize in cryptography.

Comment by owenthejumper 1 day ago

The invalidation queue is interesting, but building a custom cache key manually? Even Cloudflare now supports Cache-Tags

Comment by skeptrune 23 hours ago

We chose to do a custom cache key to avoid modifying the origin host NextJS app as much as possible. If we had more confidence in modifying the host then I agree cache-tags would have been better.

Comment by 23 hours ago

Comment by 0x3f 1 day ago

Sometimes I feel like work and needless infra complexity grows perfectly to match headcount and nominally available resources.

Comment by amichal 1 day ago

I feel the same, 72 million monthly page views is about 8 pages per second even if in a single timezone (72e6 / 8h * 30d * 3600h/s) - even with today's heavy weight pages we are talking under well under 1000 req/s. Assuming they are not super image/asset heavy i would expect this to comfortably be served by a couple of reasonable old school ngnix servers[1]. If each page was a full megabyte of uncached content we are < 10Gbits/sec. Probably under 1

The build logic to decide which things to rebuild of course is probably the interesting bits but we dont need all these services... </grey-beard-rant>

[1] https://openbenchmarking.org/test/pts/nginx&eval=c18b8feaeca...

edit: to be less ranty they are more or less building static sites out of their Next.js codebase but on-demand updated etc which is indeed interesting but none of this needs cloudflare/hyerscaler tech

Not sure how many customers/sites they have. Perhaps they don't want to spend CPU regenerating all sites on every deployment? They do describe a content-driven pre-warmer but I'm still unclear why this couldn't be a content-driven static site generator running on some build machine

Comment by 0x3f 1 day ago

The thing is you can still stick a CDN in front of your old school servers and just use a 'stale-while-revalidate' header to get exactly the effect described here.

Comment by SkiFire13 14 hours ago

We do this, but if you're redeploying fast enough thre's a change that a user loads a cached old page (or performs a client-side navigation to an old page) and makes a requests for a URL that's no longer served by the origin nor is cached by the CDN.

Comment by cloudflare728 1 day ago

I have done this with Next.js. Next.js doesn't support this header or I don't know how.

I already had HAProxy setup. So I have added stale while revalidate compatible header from HAProxy. Cloudflare handle the rest.

Edit: I am not using vercel. Self hosted using docker on EC2.

Comment by amichal 21 hours ago

Yeah, as a salty greybeard i tried to tell our FE tech-lead to just render the proper HTTP Cache-Control headers in the Next.js site we recently built. He tried and then linked me to https://nextjs.org/docs/app/guides/caching and various version of their docs on when you can and cannot set Cache-Control headers (e.g. https://nextjs.org/docs/app/api-reference/config/next-config...) and I got several hours of head-ache before calling it a problem for another day. That site is not high traffic enough to care but this is not the first time that i've gotten the "not the Next.js way" talk and was not happy.

I obviously can be done but clearly is not the intended solution which really bothers me

Comment by 0x3f 1 day ago

Well, part of the Vercel game is to lock you in to their platform and extract $$$, but as I recall you can spec out headers in NextJS config?. And possibly on CloudFlare itself via cache rules?

Comment by cloudflare728 1 day ago

I am self hosting using Docker. Next.js config to change header didn't work for me. I had cache rules in Cloudflare, but Next.js header for page (no-cache) doesn't allow Cloudflare to apply stale-while-revalidate.

Now that I have proper header added by HAProxy, Cloudflare cache rules for stale-while-revalidate works.

If anyone can reach Cloudflare. Please let us forcefully use stale-while-revalidate even when upstream server tells otherwise.

Comment by amichal 1 day ago

this too...

Comment by skeptrune 1 day ago

Stale-while-revalidate as implemented in the post was easier for us and required less resources than migrating from our dynamic site architecture to static. Ideally we would have migrated to fully static sites, but the engineering effort required to make that happen wasn't in scope.

Comment by 9rx 1 day ago

Something I noticed a long time ago is that Vercel turns everything they touch into being 10 times harder than it needs to be.

I have come to conclude it is that way because they focus on optimizing for a demo case that presents well to non-technical stakeholders. Doing one particular thing that looks good at a glance gets the buy-in, and then those who bought in never have to deal with the consequences of the decision once it is time to build something other than the demo.

Comment by 0x3f 23 hours ago

I'm no fan of Vercel, but it's kind of the symptom of a wider pattern, right? I see crazy architecture astronaut setups in so many places. It's true non-technical stakeholders can cause problems but I often see it pushed from inside the tech org too. I'm thinking it's some combination of resume-driven development, misunderstanding of 'scalability'/when it's needed, and intra-org working-together problems where it's easier to just make a new service and assert your dominion over it.

Comment by skeptrune 1 day ago

I blame this more on NextJS than Vercel, but agree in spirit. Their architecture creates a pit of failure where you're encouraged to fall into a fully dynamic pattern and is a huge trap.

However, it's probably more inexperience than anything. Nobody senior was around to tell our founders that they should go for a SSG architecture when they started /shrug. It's mostly worked out anyways though haha.

Comment by immibis 13 hours ago

If true, it's one of the only things preventing totls economic collapse due to lack of jobs.

Comment by 0x3f 4 hours ago

I would suggest that UBI in fact already exists, just in a subset of tech jobs where you have to engage in a certain kind of theater to get it. It's only by construction though that losing these jobs would be a problem. We have pointless busywork (and a ton of other problems) because housing is a failed market, essentially.

Comment by samdoesnothing 22 hours ago

A lot of people are criticizing this for unnecessary complexity, but it's a little more complicated than that. I actually think it makes sense given where they are at right now. The complexity stems from Vercel and Next.js - had they used a different tech, say Cloudflare directly and architected their own systems designed to handle rapidly changing static content none of this would have been necessary. So I guess it depends on your definition of unnecessary complexity. It's definitely unnecessary for the problem space, but probably necessary for their existing stack.

Comment by ricardobeat 23 hours ago

2025, the world rediscovers simple static caching. You could do the same with varnish/nginx or wp-cache with 10% of the complexity. Or a CDN.

“Incremental Static Regeneration” is also one of the funniest things to come out of this tech cycle.

Comment by skeptrune 23 hours ago

I have an existential crisis about joining a company so deeply bought into NextJS dark patterns every day.