Data Analysis, Software Engineering Principles (TM), and Religion
An absolutely perfect take on how to write data science code
This is a sermon I have prepared on the correct way to write software that runs data analysis. It is a religious document and a wicked hot take. If you ignore this, your single-regression project will take months to complete, and you will pointlessly waste time.
Firstly, “Object-oriented programming”, “Functional programming”, etc. The fuck. You are writing a regression. You do not need a “Feature” object. You need to get the data, do various tests, and run OLS. If anyone says the word “factory”, it’s straight to the gulags. Making something an object does not make it more “flexible” or easy to change. The thing most likely to change is the model and the only thing the program really does is run the model so, please, just estimate the model.
Secondly, no data analysis program is long-lived. It will not change; it will die. No one will need it anymore because there’s a new way to address the problem or you will be effectively re-writing the entire program and just leaving in some data-fetching/writing part. The relevant data science problems fundamentally change as the business changes. There is no way around this. There is a way to waste months of development time to design a “flexible system”.
Thirdly, your program is 1000 lines long. Having to change something later will result in such little overhead that you should not even consider it. You are writing a glorified script. If it takes you two extra months to write your flexible, extensible regression program, it will take forever for that investment to actually pay off because it’s just not that much more expensive to modify a plain, normal imperative script at this length.
You may be instructed to instead write “well-designed, maintainable, reusable” code adhering to “best practices”. Note that there are essentially no metrics on this. People rarely quantify what “maintainable” means in the discussion or even exactly what maintenance would be difficult to do with a boring imperative script. Or who exactly would re-use the code. Or that writing the code twice is even an expensive thing to do relative to the cost of writing some abstraction. It’s just asserted.
The correct response to, “But is this program/library flexible enough? What if we want to-” is then I will just write another script. Seriously. This one weird trick massively increases data science productivity. People massively understate the costs of designing a “flexible system” and massively overstate the costs of just copy-pasting some code into a new script/library and starting fresh.