Wikipedia Used for Artificial Intelligence 177
eldavojohn writes "It may be no surprise but Wikipedia is now being used in the field of artificial intelligence. The applications for this may be endless. For instance, the front of spam fighting is a tough one and it looks as though researchers are now turning towards an ontology or taxonomy based solution to fight spammers. The concept is also on the forefront of artificial intelligence and progress towards an application passing the Turing Test and creating semantically aware applications. The article comments on uses of Wikipedia in this manner: '"... spam filters block all messages containing the word 'vitamin,' but fail to block messages containing the word B12. If the program never saw B12 before, it's just a word without any meaning. But you would know it's a vitamin," Markovitch said. "With our methodology, however, the computer will use its Wikipedia-based knowledge base to infer that 'B12' is strongly associated with the concept of vitamins, and will correctly identify the message as spam," he added.'"
Wikipedia needs work for spam filtering.... (Score:2, Insightful)
Re:Wikipedia needs work for spam filtering.... (Score:5, Insightful)
uh oh, there goes wikipedia (Score:4, Interesting)
Re:uh oh, there goes wikipedia (Score:5, Insightful)
Re:uh oh, there goes wikipedia (Score:5, Interesting)
Um, so? That doesn't make it inappropriate to block traffic from places where the overwhelming majority of the packets are toxic. It's a system-by-system, admin-by-admin judgement call, but there's no question that Korea isn't doing nearly enough to stop this problem locally. If the local culture starts to realize that they're isolating themselves from large sections of the internet because they won't do something to prevent 99% of their outbound mail from being spam, then maybe the need to filter will also go away.
And what about people with business connections in China or Korea?
I have a lot of customers with contacts like that. All of them (their Asian contacts) use Yahoo, Gmail, and similar accounts specifically to avoid this problem. Businesses in China and Korea are totally aware that most ISPs in those areas have poisoned outbound SMTP relays and user desktops. Or, they host their western-facing mail servers with providers in the west - I see a lot of that, too, since many of those businesses have two separate messaging platforms for the different international audiences with whom they communicate.
Re: (Score:2, Insightful)
Re: (Score:2)
Yup, good point. Which is why the same thing seems be true to/from, say... Romania, etc. also
Re: (Score:2)
Turning the INTERnet into the HINDERnet your effort will eventually make the Internet useless. You therefore destroy what you're trying to facilitate use of. Not clever.
Re: (Score:2)
You're missing the point. When the packets from entire Class B address ranges are, by empirical testing, almost entirely crap, they people who own those addresses have already broken their little corner of the internet. Preserving the non-poisoned portion of the wider network isn't "destroying the village to save it," it's just sort of li
Re: (Score:2)
You definitely do destroy not only the village but a connected community of villages with your solution. What should be happening is bringing pressure to bear against those who have had the address space allocated to them, then moving up the supply chain. Ult
Re: (Score:2)
You're working too hard at this. The sound walls are an undesireable but nevertheless somewhat effective treatment for the symptom for a larger problem. The analogy is apt.
What should be happening is bringing pressure to bear against those who have had
Re: (Score:2)
Re: (Score:2)
I'm not suggesting you block a nation. I'm suggesting you strike a deal with someone else in that country to provide the same addresses, on pain of losing them if they can't con
Re: (Score:2)
also, having looked at enough email headers from spammers, while they may originate from some of those countries you mentioned, i notice many use accounts like Yahoo and gmail from U.S. servers, which shoots your whole theory down.
But, it's not a theory. I'm talking about what I actually see in logs and message queues, especially on rece
Re: (Score:2, Informative)
Re: (Score:3, Insightful)
In my own research I've looked at the problem of AI knowledgebase contamination and know that unless a truth validation system is employed, it is all too easy to condemn the poor AI to reasoning with flawed data. And it's very difficult to design a good validation mec
Re: (Score:2)
Re: (Score:2)
I sure as hell hope that this approach fails miserably, because I can guarantee you that the next development will be the bot-based modification of all articles in the Wikipedia. There might be some development after that of captcha interstitials before posting or modifying anything, combined with some attempt at developing a m
Re: (Score:2)
What this argument boils down to is "I don't want computers to get smarter because I don't like some of the applications." Of course there's
Re: (Score:2)
Err, no. I have no idea where you got this idea from. What I actually don't like is weak attempts at improving the intelligence of computers. Furthermore, I like even less weak attempts at improving the intelligence of computers whose direct and inevitable consequence is the corruption of an incredibly useful resource, which in turn will lead to the corruption of the AI - the initial go
Re:uh oh, there goes wikipedia (Score:5, Interesting)
"South Korea, Indonesia, and especially Nigeria, etc"
While we're at it, why not block Alberta, California, North Carolina, Virginia, Colorado, Oklahoma, Kansas, Vermont, New Hampshire, Massachusetts, Spain, France and Portugal - all spam hotspots according to the map cited? What's that, you receive email from people in these places? Tough titties, if we're to block email coming from spam hotspots as you say.
Also, you've managed to point a finger of blame at Indonesia and Nigeria who are saintly in comparison to some more developed nations. Go racism!
Re: (Score:2)
Since you were modded 'interesting', I did exactly like you told and found this page: http://mailinator.com/mailinator/map.html [mailinator.com]. Refreshed it 3 times now, and every time at least 4 balloons are pointing at the US, one at Canada and 2 or 3 at European countries. Interesting indeed.
Re: (Score:2)
A rough assessment of the last 30 days spam stored on my server suggests more than 75% comes from the USA.
A quick look at http://www.mailinator.com/mailinator/map.html [mailinator.com] shows clusters in the south (Memphis seems to be a hotspot) and on the east coast.
I don't know about Korea, but blocking Tennessee, Missouri and Florida would cut my spam in half. Blocking the rest of the USA would reduce it by 75%.
Nothing new here... (Score:5, Funny)
For years, Slashdot posts have used wikipedia as a form of artificial intelligence.
Mine Slashdot headlines (Score:2)
Comment removed (Score:3, Insightful)
Re: (Score:2)
Re: (Score:2)
WikiTuring Test (Score:2)
Re:WikiTuring Test (Score:4, Funny)
I recently got quite funny attempt like that, pumping some stock in the image attachment (which moreover looked like a captcha in order to avoid ocr). The title of the spam was however "cocaine inexcusable", and the body, well (just two sample quotes -- and yes, the two first sentences appeared together like that):
Needless to say, it triggered the bayasian filter pretty heavily in spite of all the obfuscation attempts :)
Re: (Score:2)
Want to see where their spider got this stuff?
The safe for children crap [bionictonic.co.uk] (since reworded)
The Intimate Intruder Anal Probe [bionictonic.co.uk]
The Wrist Rocket [bionictonic.co.uk]
Re: (Score:2)
Unfortunately, humans make these sorts of semantic errors all the time. We're just extending a bayesian filter to make a statem
i prefer (Score:5, Funny)
I think it would be much more effective if we used a taxidermy-based solution to fight spammers.
Cool solution to yesterday's problem (Score:2)
Of course, the minute anti-spam software/services use OCR is the minute that spam images start looking like captchas.
Re: (Score:2)
Artificial intelligence! (Score:4, Informative)
Whenever someone claims that a program is semantically aware, be sure to reread Clay Shirky's article [shirky.com] on the Semantic web.
Future trends... (Score:3, Interesting)
Uhh (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Interesting)
What would impress me is an AI that filtered spam very effectively, but also noticed that Aunt Sally had a new email address and continued to deliver her mail.
I dinna think it means what the AI thinks it means (Score:2)
UMMMM wordnet? (Score:4, Informative)
http://wordnet.princeton.edu/ [princeton.edu]
and according to my source of AI, wikipedia http://en.wikipedia.org/wiki/WordNet [wikipedia.org]
(like all sophisticated software) has been in development since the mid eighties..
WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing
Re: (Score:2, Interesting)
You must not have a very good imagination. Wikipedia articles are far larger than wordnet definitions, with much more potential to hold useful information. Wikipedia has a much larger scope than wordnet, including huge amounts of cultural, historical, and scientific data that wordnet ignores. Wikipedia has a larger team of contributors. Wikipedia has data in several other languages besides English. Wikipedia is constantly updated with
Since when (Score:4, Insightful)
Re: (Score:2)
Re:Since when (Score:5, Informative)
Re: (Score:3, Insightful)
The creative part?
Re: (Score:3, Interesting)
Re: (Score:3, Informative)
Paraphrasing to make a point: What part of computing is not detecting, storing, and applying patterns and relations?
To be meaningful, "AI" should denote more than (as the article summary indicates is being done) doing a grep through a web repository to deduce associations. There are branches of AI founded on brain neurology (neural nets), evolution (Genetic Algorithms), Bayesian logic, and various oth
Re: (Score:2)
The target hasn't moved (Score:2)
Re: (Score:2)
A red herring comment modded +5 Insightful? *Shakes head*
The keyword is part of intelligence. For instance, storing data is only a part of the "ability" called intelligence. By your logic anyone who is capable of storing is capable of artificial intelligence. However, the system advertised in this "article" has only parts of artificial intelligence. And those parts are considered rather trivial in CS.
S
Re: (Score:2)
Re: (Score:3, Interesting)
A second definition of intelligence comes from "Mainstream Science on Intelligence", which was signed by 52 intelligence researchers in 1994:
Re: (Score:2)
Re: (Score:2)
personally i'd say its the ability to solve problems WITHOUT having been designed to solve those problems and the ability to see opertunities of improvement for the current way of doing things.
cats live in our homes, foxes roam in our cities neither of those animals were designed for those environments nor have they had time for significant biological evoloution yet they find ways to manage in those environments.
and we have in a couple of centuries gone
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Just make spam a crime! (Score:4, Insightful)
Re: (Score:2)
Just because a problem is not having an obvious and overt effect on you personally doesn't erase your knowledge that something exists. Administrators are having a problem, they're telling you with their actions. If there was no spam there'd be no spam filters, if it wasn't getting worse they wouldn't need better ones. You cl
For true AI, you need 3d spacial recognition (Score:2)
Re: (Score:2)
how about pen1s en1argement? (Score:2)
associations... (Score:2)
I don't see how this is getting us anywhere except moving closer to having a spam filter that just returns "true" to anything that isn't white-listed.
Looks like good research (Score:3, Informative)
BTW, associating, clustering, etc. documents using single word statistics is computationally cheap and easy - it is also associating short word sequences that makes this a difficult problem.
Re: (Score:2)
Not very "intelligent" (Score:5, Insightful)
Not New, not newsworthy (Score:3, Informative)
The field of word sense exploration is one of the more mature areas of NLP, take a look at Princeton's WordNet database for an example [http://wordnet.princeton.edu/]. Using their word sense database (without referring to silly words such as "ontology") it has been possible - for years - to discover if two lemmas (thats "words" to you) are related in a particular way, or not related. Using wordnet it is possible to distinguish between antonyms and homonyms, thereby thwarting spammers who use words which sound like "viagra" - "niagra" and words which have opposite meanings.
Re: (Score:2)
Anybody who has been working in the field of NLP (natural language processing) can do little more than snear at this story.
Along with the title, that is one of the most useless comments one finds in /.
It is news to many of us —the great majority of readers I dare say— because we are nerds that come from different fields. I bet I could come up with common knowledge from cellular telephony that you haven't heard about and it would be news to you. If it was sufficiently interesting, it would even be newsworthy even if it's been kicked around base stations for 4 years.
You make it sound like you have deeper knowledg
Re: (Score:2)
I do think you may have been a bit harsh on grandparent; I for one, having done some work in NLP, was wondering whether anyone else was really questioning the newsworthiness of the post. So you can,
Make the people accountable (Score:2)
If the spam originated from a botnet in his machine, make him accountable too.
If he has installed the latest updates from Microsoft and still the botnet could get in, then it is not an issue. But, if he has not taken the effort to download the patches for say, the last 6 months, and a botnet operated from his machine, causing discomfiture to all and sundry, then he is accountable for i
Look up Abstraction Physics (Score:2)
considering the article is from physorg......
and to think they plan to patent it? Abstraction Physics?
I don't think so...
Perhaps this is all that we were missing for AI (Score:2)
Re: (Score:2)
Hutter Prize (Score:3, Informative)
The rigor of the benchmark is the key. The Turing Test really only benchmarks human mimicry -- not intelligence per se. The new theoretic basis of universal intelligence [hutter1.net] allows a mathematically rigorous approach to AI that is reviving the field after nearly 50 years of drifting in a stagnant pool of inadequate concepts.
But spammers can add content to WIkipedia (Score:2)
If Wikipedia content is used to determine whether a message is spam, suddenly there is a direct incentive to spammers to add spam-related content to Wikipedia.
Re: (Score:2)
This would be very bad indeed for Wikipedia because it gives a motive to vandals - and not just to the stupid vandals we have right now - but to the annoyingly inventive ones too.
Urgh!
The double-edged sword that is knowledge (Score:2)
Personally, I think spammers are already much smarter than this. It may be my imagination, but if so it's surely coming, that spammers are grabbing text from places they harvest my name and just including that text in messages rather than trying to make up things from scratch. S
As I've Said Many Times Before (Score:2)
For example, what if I'm getting information sent to me from acquaintances about life extension - references to vitamins and nutrients would abound. But it wouldn't be spam.
An AI spam blocker has to know what I'm interested in, what material I've received before that was cleared, AND has to be able to, in some sense, UNDERSTAND the content rather than just correlating it to other terms atomically in terms of frequency of occurrence. Otherwise,
PBEM (Score:2)
Ha Ha! Blocked!
You didn't sink my battleship!
-
Text of IJCAI paper (Score:3, Informative)
While IJCAI is a prestigious conference, and the results may be sound, the claims as to the applicability to spam filtering are bogus. The paraphrasal of how state-of-the art filters work is wrong, and there's no evidence that better word associations translate to better spam filter accuracy. None at all.
Should the authors wish to show applicability to spam filtering, they should do so using the TREC Spam Track methodology and datasets. http://trec.nist.gov/data/spam.html [nist.gov]
The call for participation in TREC 2007 is currently open: http://trec.nist.gov/call07.html [nist.gov] Nothing at all prevents a TREC participant from submitting a filter that includes a copy of Wikipedia, if they feel it would help.
Who needs AI? (Score:2)
I have had my GMail account for what, two years or so, and I really don't think google's spamfilter has ever missed a beat. That is to say that all the real spam I receive every day (~40 to 100 spams depending on the day) ends up in the spam folder, not my inbox. Spam is a total non-issue for me. OTOH, my hotmail inbox is so atrocious and the spamfilter so bad that I can't use the account for anything important. I don't know what kind of black magic
Skynet (Score:2)
intelligence, artificial or otherwise? (Score:2)
Re: (Score:2)
Comment removed (Score:4, Interesting)
Re: (Score:2)
That is not entirely correct. Bayesian filters work with *all* textual tokens in a message, not just the visible text in the body of the message. e.g. if your image spam all have various combinations of debora@somerandomdomain in the mail headers as a recent spambot was doing or if your spam all used the same relays and consequently has the same Received: headers, then a Bayesian filter will still rank it higher than non-spam.
Re: (Score:2)
Re: (Score:2)
OCR unnecessary (Score:2)
Bayesian filters (and other statistical filters colloqually known as Bayesian) ca
Re: (Score:2)
I'm using Thunderbird 1.5.0.9, and it seems to work great on those "book attack" spams. I haven't seen one get through yet, so they appear to be less likely to get through than normal spams.
On a guess, I'd say that a random chunk of literature is far more likely to contain words never used in valid correspond
Re: (Score:2)
I don't usually respond to ACs, but this particular belief is common enough that I feel I should say a few words. The overall goal of spam abatement is to enhance the probability that legitimate email will be delivered in a timely and efficient manner to its intended recipient. Content-based filtering is widely deployed in this context and it is fairly effective for its intended purpose. Demonstrably more effective, and less intrusive, than for
Uhm... what color is the sky in your world? (Score:2)
College students these days are often heard to say, "I have an email address but I never use it." They prefer their cell phones because voice and SMS text messages are not yet flooded with spam. Email may not be dead,
Re: (Score:2)
There's no evidence that the statement above is true. A user who has to wade through a mixture of spam and non-spam will overlook some of the non-spam. The question is whether the human or the machine will overlook more. A subsidiary question is, once overlooked, how likely is the message to be retrieved using some subsidiary
Re:The B12 example is horrible (Score:4, Informative)
Then your e-mail account's Bayes map would have the map (word B12 -> folder Aircraft) with a high probability, which would outweigh (word B12 -> article Vitamin -> folder Drug Spam).
Re: (Score:2)
Re: (Score:2)
Unless you live in Qatar [slashdot.org]. Or more practically for residents of countries with an anglophonic majority, unless you live in an area where both the local cable company and the local DSL company have policies that you consider unreasonable.
Re: (Score:2)
You mean "smarthosting" through an e-mail provider in North America or Europe, right? Otherwise, your cable or DSL connection is on the "dynamic IP" list as well as a "spam haven country" list.