Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I would like to know about more sophisticated techniques for outlier detection. These are stat 101 level. Z scores? You can get into a lot of trouble assuming a normal distribution.

What are credit card companies doing? What's the best way to combine multiple variables that are predictive of an event for outlier detection? Is there a simple framework to automate reporting of these events in real-time?

One way to model this might be to treat the outcome ("Is this event an outlier?") as a 0/1 variable and use one of the many ways to model that type of data–random forests, logistic regression, neural networks, etc. The problem is that this isn't really "outlier" detection anymore.



Anomaly detection is very different across different domains. For CC fraud / risk, you have discrete transactions so the problem is one of classification, and generally approached with supervised learning.

I don't know what you mean by combining multiple variables. Do you mean analysis methods that work with multiple variables (instead of 1-dimensional z-scores) or do you mean methods that combine multiple variables into 1, to reduce input dimensions (i.e. principal component analysis)

Because data and data reporting platforms are so different across companies, there's no 'simple framework' to do reporting. You probably want something like https://github.com/etsy/skyline.

You also describe an ensemble method for outlier detection, which is what Skyline uses. I want to note that there is no reason to consider ensembles "not" outlier detection.


I meant using multiple variables to categorize outlier events. What was shown here are also techniques to categorize discrete events ("Does this day cross some threshold?"). I guessed supervised learning methods.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: