MetaOptimize - Latest Comments

Re: PyLucene 3.0 in 60 seconds — Tutorial sample code for the 3.0 API

George London — Fri, 04 May 2012 16:09:00 -0000

Thanks for providing this. Makes jumping into PyLucene way easier.

Re: KEA Keyphrase Extraction as an XML-RPC service (code release)

MAX — Tue, 29 Nov 2011 18:54:50 -0000

Hey it seems that when I setup the xml-rpc service, the KEA stopwords functionality is not running anymore(It's working if I run it in command line), any idea about this? Thanks!

Re: Discussion 2.0: Personalization

Brian Vandenberg — Thu, 07 Jul 2011 11:01:44 -0000

It would be nice if something like this could become a reality, but the sad truth of it is that stack exchange-like sites thrive precisely because there's a carrot dangled in front of people's faces: you will become more respected, and people will care more about what you say, if your 'karma' score, or your 'reputation score' are higher.

This has been proven time and time again in many fields, among them my favorite example is the story of Virginia Apgar (founder of the Apgar score): prior to publishing about the Apgar score, the rate of surviving live births was abysmally low, but as soon as doctors had something to measure themselves by the survival rate suddenly sky-rocketed -- if for no other reason than because they had something to brag about while playing golf.

Re: Discussion 2.0: Personalization

Oliver Mitevski — Tue, 31 May 2011 08:08:26 -0000

It disturbed me that the TED talk about "filter bubbles" immensely overemphasized the flaws of personalization on the web. Nevertheless it's a tough problem.
One idea could reuse the aardvark idea. In my personalized profile, the system could serve me with questions that I could possibly answer or discussions I could join, based on my current answering record. This is actively asking me to answer/join, by sending my an email once a week for example.This way it would be more engaging for professionals/experts/scientists, because they will be served with provoking high-level questions/discussions, and the easier questions would be served to less experienced so they have a low entry barrier, and find it more engaging. Occasionally the system has to explore in order to exploit more effectively (reinforcement learning). Basically the experts will be served with some newbie questions to test them and the newbies will be asked to participate into some more serious discussions.Also people have various behavior patterns. Some have interests into various topics, but cover them superficially, others only one but cover it in full depth. Better to compute expertise with respect to a topic. Anyway more mining of profile behaviors is could help.

Re: Discussion 2.0: Personalization

Olivier Breuleux — Mon, 30 May 2011 23:05:42 -0000

Sorry, I thought these would make paragraphs :(

Re: Discussion 2.0: Personalization

Olivier Breuleux — Mon, 30 May 2011 23:04:07 -0000

My immediate concern was, too, that personalization would lead to an "I only see what I like" kind of feed. I see now that this is not what you meant, but then I'm confused about what exactly you do mean.If personalization can show me discussions about topics I like, that would indeed be useful. However, I am not sure about filtering the type of discourse: from your post I understand that personalization might skew somebody's feed towards either a flame-war style of discussion or towards a more reasoned academic style. Since arguments from both sides of any issue are considerably worse in the former case than the latter, I am afraid this might lead to an impoverishment of discourse, with facts, evidence and good arguments failing to reach a large segment of the population. To complicate matters even more, there is sometimes a divide where heated discussions lean on one side and reasoned discussions lean on the other.In my opinion, what a great discussion system needs is first and foremost a way to promote facts and sink misinformation. I would also like hate speech and ad hominem to be penalized for everyone. While some people might enjoy hate speech, I do not think society benefits from giving them what they want. A great discussion system needs to improve people, rather than simply fit their biases, and even though I know the latter is not what you intend, it seems very difficult to avoid it. The global effects of a dynamic system can be very difficult to predict, and by trying to make a personalized experience, you might inadvertently end up distributing less quality to less people - I suspect this is an attractor of personalization when it is applied to debate, though you might disagree.

Re: Discussion 2.0: Personalization

Joseph Turian — Mon, 30 May 2011 14:02:11 -0000

To quote my response to Mike Altarriba:

"I believe this is the most common misconception about my
proposal. I'm going to address this in an upcoming post. But the short
response is: Personalization of discussion doesn't filter to give you a
warm fuzzy feeling in your tummy that everything is right in the world.
Personalization of discussion gives you stimulating discussion, which
often-times means opposing viewpoints. It should also tie in discussions
on adjacent topics, to give a broader perspective on the context,
assuming you are the sort of person who doesn't like to have blinders
on.

Re: Discussion 2.0: Personalization

gorbachev — Mon, 30 May 2011 13:51:52 -0000

I don't know about this.

Personalization of comments streams can also lead to "I only see what I like" type of stream.

Re: Discussion 2.0: Personalization

Joseph Turian — Mon, 30 May 2011 13:45:25 -0000

To respond to your objections:

"audience matters" Agreed. Ideally, personalization would automatically find the right community for you, and filter out the trolls.

"text communication makes misunderstanding and anti-social behavior easy" People will learn quickly how to modulate their tone if the system buries their comments or if angry writing leads to stupid responses.

"if individuals have their input filtered based on their own
subject matter / viewpoint preferences, this will drive their world view
such that their positions and beliefs will become even more polarized" Okay, so I believe this is the most common misconception about my proposal. I'm going to address this in an upcoming post. But the short response is: Personalization of discussion doesn't filter to give you a warm fuzzy feeling in your tummy that everything is right in the world. Personalization of discussion gives you stimulating discussion, which often-times means opposing viewpoints. It should also tie in discussions on adjacent topics, to give a broader perspective on the context, assuming you are the sort of person who doesn't like to have blinders on.

"the fact that my choices mean that I don't see a particular poster or
post doesn't change the fact that said poster / posts are still there,
and still affecting the character of the online community." This is an interesting argument about ambient discussion. I'd like to explore it more. Can you give some examples? It seems to me that if something is indirectly stimulating discussion that is relevant to me, I should get shown it.

Re: Discussion 2.0: Personalization

Alexandre Passos — Mon, 30 May 2011 13:19:53 -0000

Mike, I think there is a mistake in your criticism of personalization. You point out that " the fact that my choices mean that *I* don't see a particular poster or post doesn't change the fact that said poster / posts are still there, and still affecting the character of the online community.".

This, however, assumes that there is such a thing as _the_ character of the online community. If ultra-personalization as in this post is done, each person will experience one part of a continuum of different communities. So only troll-intersted people and troll-like people would ever se the trolls, ideally, and hence they would harm no one.

Re: Discussion 2.0: Personalization

Mike Altarriba — Wed, 25 May 2011 11:52:06 -0000

I've spent a lot of time participating in online discussions, going all the way back to the electronic bulletin board days of the early 1980s, and there are some things I've noticed:

* audience matters - Usenet (the global text-based news and discussion system which preceded the World Wide Web) was, at one time, inhabited almost exclusively by academia, students, some businesspeople, and other professionals. Every Fall, a new crop of students would start showing up, and would have to be acculturated as to the 'netiquette' of the Usenet world. Then, in September 1993, AOL (a public ISP) added direct access to Usenet from their system... and the "Eternal September" began, so named because now Usenet received not a trickle of new individuals once a year, but a continuous flood... and the Signal to Noise ratio dropped precipitously, and stayed that way.

* text communication makes misunderstanding and anti-social behavior *easy*, and circumvents the social regulatory mechanisms we have when it comes to face to face interaction

* if individuals have their input filtered based on their own subject matter / viewpoint preferences, this will drive their world view such that their positions and beliefs will become even more polarized, while simultaneously giving them the false sense that they hold views which are shared by, and supported by the majority

I agree that "The core value of a discussion system is to encourage stimulating and engaging discussion."

I don't, however, think that personalization will help, because it does not address the effects of disruptive people or disruptive posts... the fact that my choices mean that *I* don't see a particular poster or post doesn't change the fact that said poster / posts are still there, and still affecting the character of the online community.

And that, I think, points us to what is needed: A system which fosters a stimulating, engaging *community* which will in turn foster stimulating and engaging *discussion*.

To do that, we need a system which provides some substitute for the social checks and balances we have inherent in face to face, community interaction, the checks and balances we've evolved over millennia of social and cultural evolution. I think the work being done on reputation economies and trust economies points us in a productive direction.

Since I've already been long-winded enough, I'll stop here, but will be glad to continue this if there is interest.

Re: PyLucene 3.0 in 60 seconds — Tutorial sample code for the 3.0 API

Cerin — Wed, 06 Apr 2011 14:47:17 -0000

Thanks for the examples. Took me a while to figure out how to build PyLucene on Ubuntu, but after that these examples worked perfectly.

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

Thilagaranim — Fri, 11 Mar 2011 03:37:59 -0000

I need database to find semantically related verbs

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

Abiya Veni — Tue, 22 Feb 2011 22:54:58 -0000

i want information about how to identify the semantic similar words

Re: Code maintainability, and the joy of outsourcing

PMP Online — Sat, 22 Jan 2011 08:05:58 -0000

This early detection leads to a more streamlined and efficient software development process and ensures the output of clean, highly maintainable code. ..

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

probreasoning — Tue, 18 Jan 2011 14:01:48 -0000

I used a very simple llr-type approach, described at https://probreasoning.wordp...

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

Griffin — Sat, 08 Jan 2011 21:18:21 -0000

So, anything ever happen with this? I checked out both lists mentioned above, but found nothing. No new blog posts pertaining to this challenge either.

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

Guillaume Pitel — Wed, 22 Dec 2010 13:08:06 -0000

I've found your challenge too late to participate, but I can still answer it :
1 - creating good features is probably the hardest part, several possibilities are opened for such a large scale problem : local embedding with thematic dimensions (provided you can gather a few thematic collections), HOOI for the PCA, or you could use my (still undisclosed) method whose results are described here : http://blog.guillaume-pitel...
2 - finding the top-K nearest neighbours can be easily done (I think) with LSH or KD-trees

As for the speed of feature extraction, I did this experiment once : 500K words as vocabulary, 20 * 40M documents (they were word windows), using my method, it took approx. 30min on a quad core with a Geforce280 (I was also experimenting with GPGPU computing).

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

anthony — Tue, 21 Dec 2010 16:39:17 -0000

woo very interesting never thought of this as a solvable problem but i will look at it and give it a good go. http://tinyurl.com/stop-sno...

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

bista — Sun, 21 Nov 2010 16:50:50 -0000

"the" is a content word when it is missing. In other words, a text without "the" is not a typical document at all. The removal of so-called "non-content" words from all documents takes some good techniques off the table.

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

Joseph Turian — Thu, 11 Nov 2010 00:53:57 -0000

The data isn't clean.

Welcome to the real world.

I'd love to throw out the numbers (at least the low ones) and the extra characters like dashes.

You can do that if you like.

Leaving the goal of "semantically defined" vague doesn't really give us much to shoot at.

Part of this task is to see how people define the problem. Part of the exercise is learning through evaluation and looking at people's outputs. Yeah, it's less well-defined and less academic that way. I find that more interesting.

I think it will also be interesting to see if there is a mismatch between people's interpretations of what "semantically related" means and what methods produce a certain interpretation.

*Is the goal really to find semantic relationships based on YOUR data, or just to find semantic relationships based on web-mined data?*

The goal is to find semantic relationships over a particular vocabulary. You can use the data set that generated that vocabulary. And/or you can use auxiliary data.

But I apologize to the extent that you don't like the setup. This is my first time doing a challenge and I'm trying to learn for next time.

Re: Hacker News, automagically organized [unofficial] - MetaOptimize

natep — Tue, 09 Nov 2010 22:51:23 -0000

Would love to see this used to visualize what topics get comments like "this isn't HN" yet still get many upvotes.

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

Mikes — Mon, 08 Nov 2010 23:06:44 -0000

I like this problem, but I think it would be better by making the following adjustments:

1) The data isn't clean. I'd love to throw out the numbers (at least the low ones) and the extra characters like dashes. If the goal is to find semantic meaning, you're adding more noise than value by including them.

2) Leaving the goal of "semantically defined" vague doesn't really give us much to shoot at. I went for co-occurrence and produced reasonable results. Another person interpreted it as "similar" and produced something different. What's the goal here? Any good algorithm needs a spec.

3) Is the goal really to find semantic relationships based on YOUR data, or just to find semantic relationships based on web-mined data? The data that you zipped up isn't terrible, but it's surprisingly noisy. We can find better data on Twitter, Wikipedia, or on most other crawls that I've seen.

The goal here is good. The setup is not.

Re: Code maintainability, and the joy of outsourcing

KPO — Mon, 08 Nov 2010 22:26:21 -0000

I believe outsourcing is necessary in today’s business environment to save on cost. It has become necessary to cut the corner every where to remain competitive. Outsourcing can also help decimate the costs, which are incurred, making the process cost effective It also allows for your company to focus on your core competencies and develop your in-house processes, which in turn reduces lead-time and brings about celerity in the market.

Ray

Re: NLP Challenge: Find semantically related terms over a large vocabulary (>1M)?

Joseph Turian — Mon, 08 Nov 2010 20:10:58 -0000

I would be very excited if you tried an LLR approach. Ted Dunning was suggesting that style of approach to me too.