Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I do this stuff foe my day job, and everything here rings true.

Every part of the "build and use ML in production" workflow is horrible (unless maybe you work at Google).

Firstly, the Datascience workflow is NOT the same as software engineering. Things like version control tools don't work properly (in every part: git on Jupyter notebooks doesn't work without hacks, versioning data is horrible, versioning models is horrible).

Deployment is horrible. Sagemaker (and equivalents) provide the very base level of functionality needed, but are so separated from the feature engineering side that everyone ends up doing vast amounts of work to get something useful.

Frameworks are horrible. TensorFlow did the upgrade to TF2, so half the examples on the web don't work anymore. The TF data loading abstractions are great - if you work at Google but so complicated to get basic examples going.

PyTorch has a horrible deployment story.

Every other frame work are either experimental research things and take months to make progress (JAX) or are so far behind modern work they are useless (MXNet).

But the thing is dealing with all these issues is worth it because ultimately it does actually work.



I think that notebooks have been a bit of a sideways step in a data science workflow. They're accessible, but they're tremendously brittle.


Notebooks are so much better than anything else I've used for data science.

I have a traditional SWEng background and came into data science never having used them. I'd never go back.

I'm not saying that they are impossible to improve, but as a general approach they are exactly right.

They are "brittle" when viewed as a software artefact. But that's not really what they are (or should be).


Not to be too self-promotion-y, but I work on an open source ML deployment that we built specifically because of how incongruous the data science workflow is to software engineering: https://github.com/cortexlabs/cortex


I'm curious to get your input on the Model Monitor, Debugger, and Experiments features on Amazon SageMaker. Have you had a chance to play around with them?


I've tried Experiments. It's great at the easy part of the ML workflow: optimising a working model. But it doesn't really help with the hard part - the debugging at the interface of the model and the data.

Say you are building a car detector or something. Building the CNN is ML101, and SageMaker experiments helps with optimising the training parameters to get the best out of the model.

But that's not really a hard thing. The hard part is working out that your model is failing on cars with reflections of people in the windscreen or something, or your dataset co-ordinate space is "negative = up" so your in memory data augmentations are making the model learn upside down cars or something.

I don't know what Debugger gives me over a notebook, but I've only read the blog post.

I haven't tried Model Monitor but I do think that could be useful.


Any experience with ML in MATLAB?


I'll chime in, I did my thesis in MATLAB (specifically ML for MRI): While matlab itself wasn't the most fun to work with, they honestly have built a fantastic suite of tools. For example the Classifier App is amazing for brute forcing through a bunch of stuff.

Even went through a couple of hackathons with it and got some SoTA results.

I wouldn't ever go back to it, especially outside of academia. But it's not the worst thing out there.


Yes.

I mean it's fine, but I don't see any reason to use it instead of Python, and lots of reasons not to. But I'm not a mathematician by training.

I do quite like RStudio though, and I do see places where that is useful. So maybe MALBAB fits somewhere in between - less stats than R, less programmign that Python.


I am a mathematician by training (+CS), I've worked in Matlab quite a bit, I've taught hundreds of students in it (having no say in language, but we're about to switch to Python), and I probably see even more reasons not to use it than you do. (I haven't done ML in it, but I'd be astounded if it wasn't terrible compared to Python and its ecosystem.)

I recommend avoiding Matlab for every use case unless they've got you trapped with a huge existing code base or reliance on a proprietary toolbox.


IMO whenever RStudio decides to support Jupyter notebooks, it's game over for everyone else. It's such a great piece of software for data analysis and I hope they continue to go broader than the just the R language.


What do you think of Julia?


> Julia

The Haskell of Machine Learning.

https://www.jwz.org/doc/worse-is-better.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: