Feb 6, 2024
Actually its limitation is more around disk space. You can store the source datasets as parquet files on S3, process them with DuckDB, and store the training data on disk. If you have enough disk space you should be able to process a high volume of data on a single node. And given that you can get pretty high-memory instances in all the major clouds, you should be able to get pretty far with 'just a single DuckDB node'.