Hello,
I am a data scientist and R/Python programmer.
I read the file that you linked, and got quite interested in this project.
I am not sure what is the purpose of these calculations, but it is basically just some text analysis. The part with the duplicates and the nested loops is where the tricks are, it is very easy if we disregard efficency, but if we do prioritize efficency, we will have to build an algorithm that goes over all the file quickly scanning for duplicates.
The rest of the calculations are easier, but could still have some tricky parts.
Overall, I will be looking forward for your reply to discuss further details.
Thank you