Hacker Newsnew | past | comments | ask | show | jobs | submit | _dcwr's commentslogin

Are you aware of the pitfall I've mentioned here?: https://news.ycombinator.com/item?id=17323068

Just making the possible passwords be 10 chars and contain chars out of a set of 62 (lowercase + uppercase + digits) gives me 0.05 sensitivity at 0.00000001 (0.000001%) FP rate.

Edit:

(This is reply to w8rbt below, here since HN has decided I cannot post anymore for next who knows how many hours.)

I'm aware of this and wouldn't treat a bloom filter as the final decision maker if there is a hit but many people seem to act like it is intended to be one and bring up its FP rate as low enough, i.e.:

1. This comment and several of its replies: https://news.ycombinator.com/item?id=17322692

2. This comment telling me false positives don't matter: https://news.ycombinator.com/item?id=17323438

3. Both of the linked bloom filters don't mention that if there is a hit you should check with https://haveibeenpwned.com/ API.

4. (Of course, this is HN after all!) my comments pointing this paradox/pitfall/whatever out being at negative point scores. I've literally lowered my score for saying "hey, 1 in 1000 FP isn't actually super accurate and almost sure hit is a good hit". And there are one or two green named users who seem to be criticizing this bloom filter with freshly made accounts, exactly because of this BS.


Read the original Bloom Filter paper from 1970. It covers all of this and it's very short. If the test for membership returns 'probably' then you do the more expensive test if you need to know for certain or you don't use one to begin with:

Space/Time Trade-offs in Hash Coding with Allowable Errors BURTON H. BLOOM

https://dl.acm.org/citation.cfm?doid=362686.362692


Due to the fact how little 500 million passwords is out of the entire set of all possible passwords (even with single digit length) the 1 in 1000 false positive rate is extremely high actually.

See my other comment: https://news.ycombinator.com/item?id=17323068


Does 1 in 1000 false positive mean that 1 in every 1000 passwords not on the list returns a true? If so, that's not good because good passwords vastly outnumber the bad ones so even a small false positive rate makes a positive very very unlikely to be a true one.

For example: let's assume all passwords (including all of the ones on haveibeenpwned list) are exactly 8 chars long and composed of 50 characters.

This gives a lot of headstart to this filter since the way possible passwords outnumber ones on the list is extremely more crushing in reality due to many possible lengths, unbounded length and there being 95 printable ASCII chars to use (even alnums themselves is 62 chars already).

  passlen = 8
  passchars = 50
  badpasses = 501636842
  allpasses = passchars ** passlen
  falsepos = 0.001
  print(badpasses / (badpasses + falsepos * allpasses))
The above script prints 0.012679079642336052, which means (in these extremely forgiving scenario from above where possible passwords are limited to next to nothing compared to real life) only just over one in a hundred positives is a true positive.

This apparently comes up with FP rates for cancer and AV and such (most people dont' have cancer, most exes aren't viruses, etc.).

I might be wrong, feel free to point that out. I'm not a statistician or a doctor, I just had a statistics class as part of my CS degree. :)

Edit: I've found the English terms and explanations:

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

https://en.wikipedia.org/wiki/False_positive_paradox


The right way to think about the base rate is the chance that a random user is going to try a bad password. You made the base assumption that the user tries a random 8-char password, which means the base rate of a bad password is 1 in 100,000. With a FP rate of 1 in 1,000 this means that 99/100 positives are false ones. In other words, if you were right, every 100,000 users would only trip the filter 100 times, and of those, 99 would be false positives.

But actually, probably more like 1 in 10 or 1 in 50 users use bad passwords. If we use 1 in 50, we get that 19/20 positives are true positives and only 1/20 are false positives. With 100,000 users we'd expect to trip the filter 2,000 times, only 100 of which are false positives.


Yep, in other words the prior probability of a user trying to use a 'bad' password will be a lot higher than badpasses/allpasses. The whole point of checking against a list of compromised passwords is because people reuse their passwords.


What you are missing is that the cost of a false negative is over 1000x the cost of a false positive.

The number of false positives don't matter, and getting a false positive matters nearly not at all. 1 in 1000 good passwords are rejected. Good passwords are nearly random, and cheap to generate. That means that average person needs to re-pick a password once every, what, 100 years?


One in thousand means the user is very unlikely to run into this event.

"But it's much more likely than running into a true positive" you might say.

Actually not, since the user runs into a true positive exactly when they choose a very common password. It's not random. If they only choose bad passwords they'll run into true positives evey single time.

On the other hand, that there exist more false positives than true ones is actually great, since this means you can't recreate the true set of passwords given the Bloom filter.


That's true if you're sampling your passwords uniformly from all legal passwords, and that's not an accurate assumption.

That being said, false positives will hopefully outnumber true positives, since not many people should use one of them.

On the use of a bloom, I would instead use a cuckoo filter. Then, since its error rate plateaus when it's close to full, you could get to the same error rate with a significantly smaller filter.


Is it really a problem, though? Having a high false-positive-to-real-positive ratio isn't as bad for this application, since a user can just pick a slightly different password, with no significant negative outcome.


PSA for anyone playing around with that haveibeenpwned file: the text list is 30 GB after un7zipping but every line is padded with many trailing spaces (up to 61 chars I think, making all lines equal in length) and the line endings are the Windows ones (\r\n, instead of just a \n). With trailing spaces and \rs stripped the file is 'just' 21 GB and fits onto a RAM disk on a 32 GB machine with no swapping.

This has bit me because I was playing with that file and to save space and fit it onto a RAM disk (to save time) I've converted it to binary and then when converting it back it had a different hash despite being 'exactly the same', only later I've realized it's using Windows line endings and trailing spaces padding (and that I didn't even have to bother with binary, since 21 GB comfortably fits in a RAM disk on a machine with 32 GB RAM while 30 GB didn't with browser, Windows, etc. running).

Troy said it's a quirk of the export, might be fixed in next version of the file but doesn't matter for the compressed 7z size (which is true, I've checked personally with 7z going at it for hours and it really did compress those line endings and spaces down to nothing).

Edit: Also see my other comment about how very overly eager at rejecting passwords this filter might be (unless I've made an error in my reasoning that someone has since pointed out): https://news.ycombinator.com/item?id=17323068


These are not any rules. This haveibeenpwned list was made of password lists that were already leaked in hacks before and its author criticizes the 'now add a number', 'now add a symbol', 'now add a capital' rules that result in 'mypassword1_A'.


1. It just does return Session(), that's easy enough to find out[1].

2. It doesn't really matter and maybe it's so they are kept closer since they are modified, Session merges your call provided and its own headers (yours take precedence) and it still handles the cookies if you provide own headers. Session also has the benefit of connection pooling so it's quicker to do more than one request with it[2] (normal get, post, etc. in requests module go through request function in the end which actually constructs a Session for that single request).

3. What's wrong there? It's just a default argument. Strings aren't mutable so it avoids this pitfall[4]. Is the " quote a problem here? It's a matter of taste/style. PEP8 is silent on it[3] and just say to pick a convention and there seems to be one here. Some people (me too) also use single quotes for non-human readable strings and double quotes for human readable strings.

4. If you mean here[5] then there is a len just above it to catch the 'expected' error/missing element, just the .text part is unchecked. As for the general lack of checks - I don't put them into my GreaseMonkey or random Python one off scraping code either. Site layout is invariant of a certain version of a scraper script so if some field is missing or something like that then the website layout must have changed and the entire script probably needs to be reworked (or the field is not always present there in the first place so the script is also useless in that particular scraping case) and might as well crash (or if its used by someone they can catch the exception). Either way (crash or catch) when something you expected to surely be there is missing the results are not coming or might be wrong. That code as it is now anticipates that there might not be such an element but if there is it must have the expected field. If the site has been observed to always work like that (certain element might be missing but surely has that field when its there) then script just works like that and guards against the first expected possibility (missing element) but not the second (missing field) since if how site is laid out, how data is stored in elements, etc. changed significantly, then the script also needs changes or risks producing bad or incomplete output (arguably worse as a default than a loud failure would be, it also depends on what you're doing and what the scrape is for).

I'd assume most users and programmers would rather get an error than have script return an empty list (despite there being content up there) just because the layout changed. The only other solution (other than return a wrong result by design and hide the errors or log them somewhere where no one cares to read anyway) would be to catch such exceptions somewhere high and either pack them into a new exception that is thrown with more information (what URL, what element content was exactly, entire response text, etc.) but that's probably too much care/work for such a one off script OR throw your entirely own ones from some low place, but it's vanity then because Python exceptions point really strongly to where they were thrown and in what call stack so it's just as clear what was broken without the need to add lots of checks yourself and throw a "element X is missing field Y that should always be there" message.

[1] - https://github.com/requests/requests/blob/master/requests/se...

[2] - http://docs.python-requests.org/en/master/user/advanced/#ses...

[3] - https://www.python.org/dev/peps/pep-0008/#string-quotes

[4] - http://www.effbot.org/zone/default-values.htm

[5] - https://github.com/tducret/amazon-scraper-python/blob/master...


Thanks for humoring my post with a real answer.

1. I never realized there was a function that just returned an instance of the class. Should've just looked it up myself.

2. I was wrong and misread the header stuff.

3. There's nothing wrong with it. It's just a convention I'm not accustomed to seeing or using. Admittedly, there are lots of ways to skin a cat with optional and default args.

4. Yeah I understand what you're saying. I guess it's a fine "greasemonkey" approach. Just more susceptible to DOM changes and code errors than I'm comfy with even for a rudimentary implementation.

I agree you'd usually want to get an error than an empty list but I think it's a little more complicated than just whether you want an error or an empty list. Sometimes you want the error but don't get it, which is why I tend to write more code around checking stuff and catching exceptions. I think the best example is you might not see an index error but the list item that's returned for your specified index isn't actually what you wanted because the DOM changed or you wrote code against one page that broke on another one you thought would be identical.


It's spelled Ad Nauseam and you shouldn't throw a commonly used Latin term around without specifying you're actually talking about an extension[0].

[0] - https://adnauseam.io/


take a breath man


The segmenting backfires hilariously between countries.

I've often seen shows go "It was Mr. X all along!", everyone looked at the person who said that in shock and awe and realization, there was a sound cue, screen faded to black and then the show continued with them going to kick Mr. X's teeth in, but already in a much less dramatic tone.

And then the actual break happens in another spot, where it's much less appropriate and breaks the flow completely instead of making me eagerly await the end of the break to see Mr. X get his teeth kicked in.

This is actually a trope: http://tvtropes.org/pmwiki/pmwiki.php/Main/CommercialBreakCl...


It's quite normal for there to be two sets of rules. Or maybe it's about the image of not being a "weaboo shit store" (as many people consider anything in anime style).

Even YouTube that's famous for being trigger happy with demonetization and bans has stuff like "Robin Thicke - Blurred Lines (Unrated Version)" which has 3 topless girls walk around in it without covering themselves at all.


Valve can state 2 + 2 = 5 tomorrow, that doesn't make it true.

They already have geoblocks in place for several reasons ranging from their own whim, to copyright to legal reasons. I can't buy certain animes[0] and I couldn't buy Skyrim for few years (my country's physical box retailer got exclusive deal), you couldn't get Hotline Miami 2 in Australia (their censor banned it), you are forced to use your own currency (or the next best thing like dollar or euro, but there is no free choice) when buying games.

They also banned Hatred back in the day for violating their ToS, until they said it didn't actually violate it (because there was a shitstorm over it). And vague ToS that anyone can break or not break are a staple in online businesses by now.

They also have no problem selling Witcher or Mass Effect games that have way more sexual content that is way more realistic and detailed. In just Witcher 1 you have like 30 sex cards, several topless women, topless bloodied vampire girls, prostitutes, very crude references to sex, straight up softcore porn scenes with full animation, but like 10 naked anime girls in Hunie Pop is too far and a big legal problem for them?

They also had no problem breaking customer protection regulations, sniffing your DNS cache and iterating what processes you have running to find cheats, fighting in Australian court over lack of return policies, taking 0 responsibility for anything they sell for years (Skyrim paid mods, Greenlight trash, asset flips, scams and stolen games, etc.) but now hosting a couple games with anime boobs is a big no-no that opens them up to big legal problems.

All this content was also explicitly stated to not be any problem until now up to reassuring developers themselves about their specific games, do Valve and Steam operate their multi-billion global businesses in a "shoot first, ask if it's legal later" way?

All this is is just deciding to not bother with "weaboo shit" (as many people consider anything in anime style) on Steam Direct anymore for whatever reason (money, image?) and knowing they can do that without backlash because it's niche enough. There is no legal (unless countries now specifically ban static naked images of anime girls but allow 3D softcore mocap scenes of realistic models) or technical reason they couldn't make it work.

[0] - https://imgur.com/a/2Fbm9Bl


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: