CANNOT SCALE BIG DATA PROCESSING
Budget €30-250 EUR
I built an ETL pipeline to process terabytes of data. To achieve that goal, I setup a Spark Cluster (Scala) and MinIO server for object data storage.
I can process and save 200 gigabytes in roughly 30 minutes using 10 virtual machines, for Spark Processing.
The issue I have is that I am not able to scale that Processing. Meaning if I double the number of spark virtual machines, this does not affect processing time.
I need a Data Architect who has enough expertise to help me identify the bottleneck and fix the issue.
• I use virtual machines set up on-premises using VMWare ESXi 6
• Physical machines (which host VMs) are on a 1 GB network.
• There is no over commitment for vCPU nor RAM
• Spark VMs. 16VCPU, 64 GB RAM
• MinIO (Storage). 16vCPU, 64GB RAM, Configured using RAID0
SOME DETAILS ABOUT DATA PROCESSING
The process is straight.
• Read data from 2 sources on MinIO,
• Make a Union of data of two sources,
• Filter out empty values on a column from resulting dataset,
• Apply 2 groupby on that column (We save intermediate values after the first groupby)
• Union the dataset obtained after the groupby operation with the empty columns values
• Save the whole again on MinIO
5 freelancers are bidding on average €334 for this job
Hi there, How are you? I have gone through your project details. I would like to tell you that l have a great bunch of experience in VMware, Spark, Data Engineer, Big Data and Amazon S3. For that I would require from More
Hi Saint Denis, I am a Data Engineer with 7+year of experience. I would like to offer you help to fix this issue. Please let me know if we can connect .
Hi, I hv ,,10 years of exp in this. I would like to work for you. As i have already did the similar task and supported many projects/person in the same way etc. I would like to hear from your side. Thank you for
Hi, I am a data engineer of 5 years experience. I have designed and built large scale spark pipelines for use cases similar to yours. Unfortunately as you might be aware there are no straight forward answer to your pro More