Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hah, my friend and I did nearly the exact same project in college, though minus the publication. We had an open ended project for an intro to machine learning class we were taking.

We ended up using the million song dataset, because I'm not sure Spotify gave out this data six years ago, which includes various info about roughly a million songs including artist, length, and supposedly Echonest api results for things like "dancyness". We then merged this with a list of something like 250k results of play counts. We then found out the Echonest data was quite literally all just set to null, so I went out to their api, signed up for a developer key, and spent six days querying to fill out our dataset.

We were massive novices to machine learning, so we basically were just script-kiddying it, and pretty much none of the models we made over a 24ish (because we were dumb college students doing things last minute) period had any significant accuracy. Finally we made a random forest model that was able to, with 80% accuracy, predict the "magnitude" of plays, ie roughly whether a song would get a million plays or a thousand.

When we broke it down (model explainability is an awesome feature) we found that out of everything interesting we had done with feature investigation and data cleaning etc, the model was about 90% based on which artist made the song. In retrospect, that makes sense, in a sort of cynical way; even a great song by an unknown artist rarely makes it big. The moral of the story I guess is that machine learning isn't magic

I still have all the data, and I've been meaning to revisit it now that I actually have a better understanding of the field. It's on my list of things to revisit/do, a very long list



Ha. I did this exact same thing for a project in college using echonest and linear regression. In the end, we were unable to find a single statistically significant coefficient. We ended up having to change our project completely. Kudos to your team for finding something there


I also did something similar in college but due to similar issues noted pivoted to genre classification with extracted audio features. With that though I was actually able to get a pretty accurate classifier going.


First experiment I would do is remove the artist field from the input :)


I also wanted to do this for the "eventual" revisit. We also wanted to try more computationally intensive training systems, including possibly neural networks, but my poor 4th gen i5 just could not cope. I'm waiting for AMD accelerated training to be mostly trivial, so possibly forever.


And maybe eliminate the top 100 artist?


"90% based on which artist made the song"

Doesn't that demonstrate that the actual business case of producers discovering new artists doesn't even factor into the model's case of discovering which new songs will actually be hits.

It seems like the former is much harder than the latter in this case.


I think you will find is that the process of an artist being discovered is basically the same as getting into YC.

It's about the artist and whether they have star quality and are saleable. There are plenty of song writers to write the actual songs.

Classic case is someone like Sia Furler who has written a ton of hits for other artists.

https://time.com/4209769/sia-best-songs-written-for-other-ar...


Keep in mind that our results and findings, while making "sense" post-hoc, are predicated on our process and procedure being correct, which honestly, as complete newbies to the whole field, was roughly a coin flip.


Not only that, but sales figures and marketing spend are also not included.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: