Sitemap
1 min readJan 30, 2024

--

Doing some back-of-the-envelope type calculations, you're talking about 10GB of generated data per second, where each transaction has a payload of about 1-12KB (1MM-10MM transactions).

Ultimately it's a question of how fast you need to consume the processed data and what kind of transformations you need.

If I was trying to figure this out, I would take a sample of the data (i.e., 1 minute = ~700GB) and build a processing pipeline around it.

Things to look for:

- how long does it take to process a minute worth of data

- what's a good compromise between machine spec and how much it costs => basically what will it cost to process one minute of data; note: you might need more than one machine, depending on your latency requirements)

- you can reduce costs if your latency needs aren't high (i.e. if data can be delayed, you can rely on any number of strategies: use spot instances and retry failed jobs, catch-up on periods with low number of transactions - i.e., at night, catch-up on weekends, sample the data, etc.)

Hope this helps!

--

--

Mihai Bojin
Mihai Bojin

Written by Mihai Bojin

Software Engineer at heart, Manager by day, Indie Hacker at night. Writing about DevOps, Software engineering, and Cloud computing. Opinions my own.

Responses (1)