WebSockets cost us $1M on our AWS bill
https://www.recall.ai/post/how-websockets-cost-us-1m-on-our-aws-billBy tosh at
gwbas1c | 2 comments | 6 hours ago
---
I have a similar story: Where I work, we had a cluster of VMs that were always high CPU and a bit of a problem. We had a lot of fire drills where we'd have to bump up the size of the cluster, abort in-progress operations, or some combination of both.
Because this cluster of VMs was doing batch processing that the founder believed should be CPU intense, everyone just assumed that increasing load came with increasing customer size; and that this was just an annoyance that we could get to after we made one more feature.
But, at one point the bean counters pointed out that we spent disproportionately more on cloud than a normal business did. After one round of combining different VM clusters (that really didn't need to be separate servers), I decided that I could take some time to hook up this very CPU intense cluster up to a profiler.
I thought I was going to be in for a 1-2 week project and would follow a few worms. Instead, the CPU load was because we were constantly loading an entire table, that we never deleted from, into the application's process. The table had transient data that should only last a few hours at most.
I quickly deleted almost a decade's worth of obsolete data from the table. After about 15 minutes, CPU usage for this cluster dropped to almost nothing. The next day we made the VM cluster a fraction of its size, and in the next release, we got rid of the cluster and merged the functionality into another cluster.
I also made a pull request that introduced a simple filter to the query to only load 3 days of data; and then introduced a background operation to clean out the table periodically.
alsetmusic | 1 comment | 5 hours ago
gwbas1c | 1 comment | 5 hours ago
I was really disappointed when my wife couldn't get the night off from work when the company took everyone out to a fancy steak house.
chgs | 5 comments | 4 hours ago
ChadNauseam | 0 comments | 50 minutes ago
bagels | 0 comments | 4 hours ago
ponty_rick | 0 comments | 4 hours ago
Cyphase | 1 comment | 4 hours ago
tempest_ | 0 comments | 3 hours ago
_hyn3 | 1 comment | 4 hours ago
You seem to be assuming that a $200 meal was the only compensation the person received, and they weren't just getting a nice meal as a little something extra on top of getting paid for doing their job competently and efficiently.
But that's the kind of deal I make when I take a job: I do the work (pretty well most of the time), and I get paid. If I stop doing the work, I stop getting paid. If they stop paying, I stop doing the work. (And bonus, literally, if I get a perk once in a while like a free steak dinner that I wasn't expecting)
It doesn't have to be more complicated than that.
meiraleal | 2 comments | 3 hours ago
Sohcahtoa82 | 1 comment | 2 hours ago
fn-mote | 0 comments | 2 minutes ago
Of course the truth is more complicated than the sound bite, but still...
groby_b | 0 comments | 2 hours ago
That's independent of pay scale.
Granted, if you pay way below expectations, you'll lose the professionals over time. But if you pay lavishly no matter what, you get the 2021/2022 big tech hiring cycle instead. Neither one is a great outcome.
antisthenes | 1 comment | 4 hours ago
dd82 | 1 comment | 4 hours ago
Quekid5 | 2 comments | 3 hours ago
That's what quadratic means.
jon_richards | 0 comments | 11 minutes ago
sgerenser | 0 comments | 2 hours ago
wiml | 4 comments | 6 hours ago
It's weird to be living in a world where this is a surprise but here we are.
Nice write up though. Web sockets has a number of nonsensical design decisions, but I wouldn't have expected that this is the one that would be chewing up all your cpu.
arccy | 1 comment | 5 hours ago
adastra22 | 0 comments | 3 hours ago
carlhjerpe | 0 comments | 5 hours ago
handfuloflight | 0 comments | 6 hours ago
I think it's because the cost of it is so abstracted away with free streaming video all across the web. Once you take a look at the egress and ingress sides you realize how quickly it adds up.
sensanaty | 0 comments | 4 hours ago
I can easily imagine the author being in a similar boat, knowing that it isn't cheap, but then not realizing that expensive in this context truly does mean expensive until they actually started seeing the associated costs.
turtlebits | 2 comments | 7 hours ago
VWWHFSfQ | 1 comment | 7 hours ago
I doubt they would have even noticed this outrageous cost if they were running on bare-metal Xeons or Ryzen colo'd servers. You can rent real 44-core Xeon servers for like, $250/month.
So yes, it's an AWS issue.
JackSlateur | 3 comments | 7 hours ago
You can rent real 44-core Xeon servers for like, $250/month.
Where, for instance ?Faaak | 1 comment | 7 hours ago
dilyevsky | 3 comments | 7 hours ago
dijit | 2 comments | 7 hours ago
GCP exposes their cpu models, and they have some Haswell and Broadwell lithographies in service.
Thats a 10+ year old part, for those paying attention.
tsimionescu | 1 comment | 6 hours ago
dijit | 1 comment | 6 hours ago
akvadrako | 0 comments | 5 hours ago
I used to work for a company that rented lots of hetzner boxes. Consumer grade hardware with frequent disk failures was just what we excepted for saving a buck.
dilyevsky | 1 comment | 6 hours ago
dijit | 1 comment | 6 hours ago
dilyevsky | 1 comment | 6 hours ago
dijit | 0 comments | 6 hours ago
Anyway, depending on individual nodes to always be up for reliability is incredibly foolhardy. Things can happen, cloud isn't magic, Iāve had instances become unrecoverable. Though it is rare.
So, I still donāt understand the point, that was not exactly relevant to what I said.
blibble | 1 comment | 6 hours ago
AWS: E5-2680 v4 (2016)
Hetzner: Ryzen 5 (2019)
dilyevsky | 1 comment | 3 hours ago
blibble | 0 comments | 2 hours ago
the AWS one is some emulated block device, no idea what it is, other than it's 20x slower
speedgoose | 0 comments | 6 hours ago
VWWHFSfQ | 1 comment | 7 hours ago
petcat | 2 comments | 7 hours ago
phonon | 1 comment | 6 hours ago
[0]https://instances.vantage.sh/aws/ec2/c8g.12xlarge?region=us-... [1]https://portal.colocrossing.com/register/order/service/480 [2]https://browser.geekbench.com/v6/cpu/8305329 [3]https://browser.geekbench.com/processors/intel-xeon-e5-2699-...
petcat | 1 comment | 5 hours ago
That's not dedicated 48 cores, it's 48 "vCPUs". There are probably 1,000 other EC2 instances running on those cores stealing all the CPU cycles. You might get 4 cores of actual compute throughput. Which is what I was saying
phonon | 1 comment | 5 hours ago
petcat | 1 comment | 4 hours ago
phonon | 1 comment | 4 hours ago
In fact, you can even get a small discount with the -flex series, if you're willing to compromise slightly. (Small discount for 100% of performance 95% of the time).
petcat | 1 comment | 3 hours ago
phonon | 0 comments | 35 minutes ago
fragmede | 1 comment | 7 hours ago
petcat | 0 comments | 6 hours ago
GauntletWizard | 0 comments | 7 hours ago
brazzy | 3 comments | 7 hours ago
turtlebits | 1 comment | 5 hours ago
If I said that "childbirth cost us 5000 on our <hospital name> bill", you assume the issue is with the hospital.
Capricorn2481 | 0 comments | 3 hours ago
bigiain | 0 comments | 5 hours ago
anitil | 0 comments | 5 hours ago
trollied | 5 comments | 6 hours ago
> Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.
Just highlights that they do not have enough technical knowledge in house. Should spend the $1m/year saving on hiring some good devs.
hathawsh | 1 comment | 5 hours ago
Edit: I guess perhaps you're saying that they don't know all the networking configuration knobs they could exercise, and that's probably true. However, they landed on a more optimal solution that avoided networking altogether, so they no longer had any need to research network configuration. I'd say they made the right choice.
maxmcd | 0 comments | 5 hours ago
karamanolev | 1 comment | 5 hours ago
drowsspa | 1 comment | 5 hours ago
bcrl | 1 comment | 5 hours ago
More shocking to me is that anyone would attempt to run network throughput oriented software inside of Chromium. Look at what Cloudflare and Netflix do to get an idea what direction they should really be headed in.
oefrha | 0 comments | 3 hours ago
Whatās surprising to me is they canāt access the compressed video on the wire and have to send decoded raw video. But presumably theyāve thought about that too.
lttlrck | 0 comments | 4 hours ago
adamrezich | 0 comments | 4 hours ago
maxmcd | 0 comments | 5 hours ago
IX-103 | 0 comments | 6 hours ago
But writing a custom ring buffer implementation is also nice, I suppose...
austin-cheney | 1 comment | 3 hours ago
So they are only half way correct about masking. The RFC does mandate that client to server communication be masked. That is only enforced by web browsers. If the client is absolutely anything else just ignore masking. Since the RFC requires a bit to identify if a message is masked and that bit is in no way associated to the client/server role identity of the communication there is no way to really mandate enforcement. So, just don't mask messages and nothing will break.
Fragmentation is completely unavoidable though. The RFC does allow for messages to be fragmented at custom lengths in the protocol itself, and that is avoidable. However, TLS imposes message fragmentation. In some run times messages sent at too high a frequency will be concatenated and that requires fragmentation by message length at the receiving end. Firefox sometimes sends frame headers detached from their frame bodies, which is another form of fragmentation.
You have to account for all that fragmentation from outside the protocol and it is very slow. In my own implementation receiving messages took just under 11x longer to process than sending messages on a burst of 10 million messages largely irrespective of message body length. Even after that slowness WebSockets in my WebSocket implementation proved to be almost 8x faster than HTTP 1 in real world full-duplex use on a large application.
simoncion | 0 comments | 51 minutes ago
If one is doing websockets on the local machine (or any other trusted network) and one has performance concerns, one should maybe consider not doing TLS.
If the websocket standard demands TLS, then I guess getting to not do that is would be another benefit of not using a major-web-browser-provided implementation.
handfuloflight | 2 comments | 8 hours ago
DrammBA | 0 comments | 6 hours ago
lawrenceduk | 0 comments | 8 hours ago
Jokes aside though, some good performance sleuthing there.
marcopolo | 1 comment | 6 hours ago
The linked section of the RFC is worth the read: https://www.rfc-editor.org/rfc/rfc6455#section-10.3
moron4hire | 0 comments | 3 hours ago
The RFC has a link to a document describing the attack, but the link is broken.
pier25 | 1 comment | 6 hours ago
Was it because they didn't want to use some multicast video server?
IamLoading | 0 comments | 4 hours ago
sfink | 1 comment | 5 hours ago
The initial approach was shipping raw video over a WebSocket. I could not imagine putting something like that together and selling it. When your first computer came with 64KB in your entire machine, some of which you can't use at all and some you can't use without bank switching tricks, it's really really hard to even conceive of that architecture as a possibility. It's a testament to the power of today's hardware that it worked at all.
And yet, it did work, and it served as the basis for a successful product. They presumably made money from it. The inefficiency sounds like it didn't get in the way of developing and iterating on the rest of the product.
I can't do it. Premature optimization may be the root of all evil, but I can't work without having some sense for how much data is involved and how much moving or copying is happening to it. That sense would make me immediately reject that approach. I'd go off over-architecting something else before launching, and somebody would get impatient and want their money back.
ketzo | 0 comments | 3 hours ago
Knowing thyself is a superpower all its own; we need people to write scrappy code to validate a business idea, and we need people who look at code with disgust, throw it out, and write something 100x as efficient.
cogman10 | 4 comments | 7 hours ago
Here they have a nicely compressed stream of video data, so they take that stream and... decode it. But they aren't processing the decoded data at the source of the decode, so instead they forward that decoded data, uncompressed(!!), to a different location for processing. Surprisingly, they find out that moving uncompressed video data from one location to another is expensive. So, they compress it later (Don't worry, using a GPU!)
At so many levels this is just WTF. Why not forward the compressed video stream? Why not decompress it where you are processing it instead of in the browser? Why are you writing it without any attempt at compression? Even if you want lossless compression there are well known and fast algorithms like flv1 for that purpose.
Just weird.
isoprophlex | 1 comment | 7 hours ago
As it turns out, doing something in Rust does not absolve you of the obligation to actually think about what you are doing.
dylan604 | 0 comments | 6 hours ago
bri3d | 1 comment | 5 hours ago
I'm pretty sure that feeding the browser an emulated hardware decoder (ie - write a VAAPI module that just copies compressed frame data for you) would be a good semi-universal solution to this, since I don't think most video chat solutions use DRM like Widevine, but it's not as universal as dumping the framebuffer output off of a browser session.
They could also of course one-off reverse each meeting service to get at the backing stream.
What's odd to me is that even with this frame buffer approach, why would you not just recompress the video at the edge? You could even do it in Javascript with WebCodecs if that was the layer you were living at. Even semi-expensive compression on a modern CPU is going to be way cheaper than copying raw video frames, even just in terms of CPU instruction throughput vs memory bandwidth with shared memory.
It's easy to cast stones, but this is a weird architecture and making this blog post about the "solution" is even stranger to me.
cogman10 | 0 comments | 4 hours ago
I mean, I would presume that the entire reason they forked chrome was to crowbar open the black box to get at the goodies. Maybe they only did it to get a framebuffer output stream that they could redirect? Seems a bit much.
Their current approach is what I'd think would be a temporary solution while they reverse engineer the streams (or even get partnerships with the likes of MS and others. MS in particular would likely jump at an opportunity to AI something).
> What's odd to me is that even with this frame buffer approach, why would you not just recompress the video at the edge? You could even do it in Javascript with WebCodecs if that was the layer you were living at. Even semi-expensive compression on a modern CPU is going to be way cheaper than copying raw video frames, even just in terms of CPU instruction throughput vs memory bandwidth with shared memory.
Yeah, that was my final comment. Even if I grant that this really is the best way to do things, I can't for the life of me understand why they'd not immediately recompress. Video takes such a huge amount of bandwidth that it's just silly to send around bitmaps.
> It's easy to cast stones, but this is a weird architecture and making this blog post about the "solution" is even stranger to me.
Agreed. Sounds like a company that likely has multiple million dollar savings just lying around.
tbarbugli | 0 comments | 6 hours ago
rozap | 1 comment | 7 hours ago
dylan604 | 1 comment | 6 hours ago
Context matters? As someone working in production/post, we want to keep it uncompressed until the last possible moment. At least as far as no more compression than how it was acquired.
DrammBA | 1 comment | 6 hours ago
It does, but you just removed all context from their comment and introduced a completely different context (video production/post) for seemingly no reason.
Going back to the original context, which is grabbing a compressed video stream from a headless browser, the correct approach to handle that compressed stream is to leave it compressed until the last possible moment.
pavlov | 0 comments | 5 hours ago
With that constraint, letting a full browser engine decode and composite the participant streams is the only option. And it definitely is an expensive way to do it.
cosmotic | 3 comments | 8 hours ago
pavlov | 1 comment | 7 hours ago
Since they don't have API access to all these platforms, the best they can do to capture the A/V streams is simply to join the meeting in a headless browser on a server, then capture the browser's output and re-encode it.
MrBuddyCasino | 2 comments | 7 hours ago
pavlov | 0 comments | 6 hours ago
To my knowledge, Zoomās web client uses a custom codec delivered inside a WASM blob. How would you capture that video data to forward it to your recording system? How do you decode it later?
Even if the incoming streams are in a standard format, compositing the meeting as a post-processing operation from raw recorded tracks isnāt simple. Video call participants have gaps and network issues and layer changes, you canāt assume much anything about the samples as you would with typical video files. (Coincidentally this is exactly what Iām working on right now at my job.)
moogly | 0 comments | 7 hours ago
ketzo | 0 comments | 8 hours ago
Recall's offering allows you to get "audio, video, transcripts, and metadata" from video calls -- again, total conjecture, but I imagine they do need to decode into raw format in order to split out all these end-products (and then re-encode for a video recording specifically.)
Szpadel | 0 comments | 7 hours ago
a_t48 | 0 comments | 8 hours ago
devit | 0 comments | 4 hours ago
A more reasonable approach would be to have Chromium save the original compressed video to disk, and then use ffmpeg or similar to reencode if needed.
Even better not use Chromium at all.
Dylan16807 | 0 comments | 7 hours ago
akira2501 | 3 comments | 7 hours ago
They seem to not understand the fundamentals of what they're working on.
> Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.
You're doing bulk data transfers into a multiplexed short messaging socket. What exactly did you expect?
> However there's no standard interface for transporting data over shared memory.
Yes there is. It's called /dev/shm. You can use shared memory like a filesystem, and no, you should not be worried about user/kernel space overhead at this point. It's the obvious solution to your problem.
> Instead of the typical two-pointers, we have three pointers in our ring buffer:
You can use two back to back mmap(2) calls to create a ringbuffer which avoids this.
Scaevolus | 0 comments | 7 hours ago
>50 GB/s of memory bandwidth is common nowadays[1], and will basically never be the bottleneck for 1080P encoding. Zero copy matters when you're doing something exotic, like Netflix pushing dozens of GB/s from a CDN node.
[1]: https://lemire.me/blog/2024/01/18/how-much-memory-bandwidth-...
didip | 0 comments | 6 hours ago
And since it behaves like filesystem, you can swap it with real filesystem during testing. Very convenient.
I am curious if they tried this already or not and if they did, what problems did they encounter?
anonymous344 | 0 comments | 7 hours ago
ComputerGuru | 5 comments | 8 hours ago
(I also wouldn't be surprised if they had even more memory copies than they let on, marshalling between the GC-backed JS runtime to the GC-backed Python runtime.)
I was coming back to HN to include in my comment a link to various high-performance IPC libraries, but another commenter already beat me linking to iceoryx2 (though of course they'd need to use a python extension).
SHM for IPC has been well-understood as the better option for high-bandwidth payloads from the 1990s and is a staple of Win32 application development for communication between services (daemons) and clients (guis).
diroussel | 0 comments | 7 hours ago
On the outside we canāt be sure. But itās possible that they took the right decision to go with a naĆÆve implementation first. Then profile, measure and improve later.
But yes the hole idea of running a headless web browser to get run JavaScript to get access to a video stream is a bit crazy. But I guess thatās just the world we are in.
CharlieDigital | 0 comments | 7 hours ago
> I don't mean to be dismissive, but this would have been caught very early on (in the planning stages) by anyone that had/has experience in system-level development rather than full-stack web js/python development
Based on their job listing[0], Recall is using Rust on the backend.Sesse__ | 0 comments | 7 hours ago
randomdata | 0 comments | 7 hours ago
The product is not a full-stack web application. What makes you think that they brought in people with that kind of experience just for this particular feature?
Especially when they claim that they chose that route because it was what was most convenient. While you might argue that wasn't the right tradeoff, it is a common tradeoff developers of all kinds make. āMake It Work, Make It Right, Make It Fastā has become pervasive in this industry, for better or worse.
whatever1 | 0 comments | 7 hours ago
cperciva | 3 comments | 6 hours ago
Are you sure about that? Atomics are not locks, and not all systems have strong memory ordering.
Sesse__ | 0 comments | 5 hours ago
CodesInChaos | 1 comment | 4 hours ago
Atomics require you to explicitly specify a memory ordering for every operation, so the system's memory ordering doesn't really matter. It's still possible to get it wrong, but a lot easier than in (traditional) C.
reitzensteinm | 0 comments | 3 hours ago
But yes, it's an order of magnitude easier to get portability right using the C++/Rust memory model than what came before.
jpc0 | 1 comment | 5 hours ago
Pretty sure ARM and x86 you would be seeing on AWS does have strong memory ordering, and has atomic operations that operate on something the size of a single register...
cperciva | 0 comments | 5 hours ago
calibas | 0 comments | 4 hours ago
bauruine | 0 comments | 6 hours ago
CyberDildonics | 0 comments | 7 hours ago
"using WebSockets over loopback was ultimately costing us $1M/year in AWS spend"
then
"and the quest for an efficient high-bandwidth, low-latency IPC"
Shared memory. It has been there for 50 years.
londons_explore | 3 comments | 7 hours ago
And the GPU for rendering...
So they should instead just be hooking into Chromium's GPU process and grabbing the pre-composited tiles from the LayerTreeHostImpl[1] and dealing with those.
[1]: https://source.chromium.org/chromium/chromium/src/+/main:cc/...
orf | 0 comments | 7 hours ago
mbb70 | 0 comments | 7 hours ago
isoprophlex | 1 comment | 7 hours ago
yjftsjthsd-h | 1 comment | 5 hours ago
I dunno, when we're playing with millions of dollars in costs I hope they're at least regularly evaluating whether they could at least run some of the workload on GPUs for better perf/$.
londons_explore | 0 comments | 5 hours ago
OptionOfT | 4 comments | 7 hours ago
nemothekid | 0 comments | 7 hours ago
The memcopys are the cost that they were paying, even if it was local.
ted_dunning | 0 comments | 4 hours ago
The basic point is that WebSockets requires that data move across channels that are too general and cause multiple unaligned memory copies. The CPU cost to do the copies was what cost the megabuck, not network transfer costs.
magamanlegends | 0 comments | 7 hours ago
jgauth | 1 comment | 7 hours ago
kunwon1 | 1 comment | 7 hours ago
DrammBA | 0 comments | 6 hours ago
From the article intro before they dive into what exactly is using the CPU.
ahmetozer | 0 comments | 4 hours ago
beoberha | 0 comments | 6 hours ago
dbrower | 0 comments | 7 hours ago
renewiltord | 0 comments | 7 hours ago
cyberax | 1 comment | 5 hours ago
ted_dunning | 0 comments | 4 hours ago
Read the article.
jazzyjackson | 0 comments | 6 hours ago
hipadev23 | 2 comments | 8 hours ago
ted_dunning | 0 comments | 4 hours ago
cynicalsecurity | 0 comments | 7 hours ago
jgalt212 | 1 comment | 7 hours ago
As a point of comparison, how many TB per second of video does Netflix stream?
ffsm8 | 0 comments | 7 hours ago
Netflix has hardware ISPs can get so they can serve their content without saturating the ISPs lines.
There is a statistic floating around that Netflix was responsible for 15% of the global traffic 2022/2023, and YouTube 12%. If that number is real... That'd be a lot more
yapyap | 1 comment | 7 hours ago
thatās surprising to.. almost no one? 1TBPS is nothing to scoff at
blibble | 1 comment | 7 hours ago
assuming you're only shuffling bytes around, on bare metal this would be ~20 DDR5 channels worth
or 2 servers (12 channels/server for EPYC)
you can get an awful lot of compute these days for not very much money
(shipping your code to the compressed video instead of the exact opposite would probably make more sense though)
pyrolistical | 1 comment | 5 hours ago
blibble | 0 comments | 2 hours ago
pro-tip: it's quite a bit bigger than a terabit