How does text and image SafeSearch work?
Summary:
Matt's answer:
How does SafeSearch, both for text and images, work?
I worked on the initial version of SafeSearch for text. So let’s concentrate on that. I don’t want to give away anything that spammers could use, but I can talk about way back in 2000 how SafeSearch worked, so you can get an idea. And the idea is roughly what you would expect, which is: we look for certain words, and we give them certain weight. And if you have enough words with enough weight, then we sort of say – “OK, this looks like it might be a sort of porn or porn-related document”.
You can have various thresholds, where you can say – “OK, it might be safe at this level, but unsafe once you get too many”. If it’s a book or if it’s a really long thing and it’s got one word, that’s not quite as bad as if you have just a very small document and you have that same word.
Some words are worse and more likely to be pornographic than other words
Certain slang terms, turn out misspellings… So, “amateur” misspelled A-M-A-T-U-R-E is much more likely to be “amateur porn” than “amateur radio” or something along those lines. You do have to be careful, because there are words like breast, which can be breast cancer, or sex can be sex education. You do want to try to do the learning to learn which words should carry which weights and which words should have more weight, and those sorts of things. But it actually is relatively sophisticated in terms of trying to figure out; you can imagine doing a lot more than just pure content analysis or using just straight words. But at least to a first approximation, that’s a pretty good way to sort of classify something as porn or not.
One thing that I wanted to mention: if you think you have been detected as porn when you’re not pornographic, or you think you found a bug or an error with SafeSearch, you can report that and pass that information along. And so people can adjust the algorithms or otherwise make improvements so that we don’t necessarily say that a site that is really, really good is pornographic if it’s not.
You would be surprised at how well just doing some pretty simple scanning with some relatively simple weights can catch a large fraction of the porn on the web. Previous search engines, just a little bit of historical digression here, at least I remember in the early days, AltaVista, you could search for sex and have their family mode on, and they would have only like 20 results returned. Because they had basically said “OK, we are only going to allow these results for this query” or “We’re only going to say these results are safe”.
The mental model that Google had was different
We said – “OK, if there’s a mother, she’s searching with her Cub Scout son, would she be surprised, would she be offended by the results?” But at the same time, you’d like to get the comprehensiveness of the web. You’d like to score the entire web and find the documents that are porn and exclude those. But then if there’s something about sex education or things along those lines, you would like those to be returned. So it’s a pretty good approach. It’s worked very well. And thankfully, there’s a much better team of engineers who are much more sophisticated in the ways that they analyze pages now, so all of that original stuff that I wrote back in 2000 I’m sure has been replaced by much better stuff at this point.
by Matt Cutts - Google's Head of Search Quality Team