Question

I find myself having to learn new things all the time. I've been trying to think of ways I could expedite the process of learning new subjects. I thought it might be neat if I could write a program to parse a wikipedia article and remove everything but the most valuable information.

I started by taking the Wikipedia article on PDFs and extracting the first 100 sentences. I gave each sentence a score based on how valuable I thought it was. I ended up creating a file following this format:

<sentence>
<value>
<sentence>
<value>
etc.

I then parsed this file and attempted to find various functions that would correlate each sentence with the value I had given it. I've just begun learning about machine learning and statistics and whatnot, so I'm doing a lot of fumbling around here. This is my latest attempt: https://github.com/JesseAldridge/Wikipedia-Summarizer/blob/master/plot_sentences.py.

I tried a bunch of stuff that didn't seem to produce much of any correlation at all -- average word length, position in the article, etc. Pretty much the only thing that produced any sort of useful relationship was the length of the string (more specifically, counting the number of lowercase letter 'e's seemed to work best). But that seems kind of lame, because it seems obvious that longer sentences would be more likely to contain useful information.

At one point I thought I had found some interesting functions, but then when I tried removing outliers (by only counting the inner quartiles), they turned out to produce worse results then simply returning 0 for every sentence. This got me wondering about how many other things I might be doing wrong... I'm also wondering whether this is even a good way to be approaching this problem.

Do you think I'm on the right track? Or is this just a fool's errand? Are there any glaring deficiencies in the linked code? Does anyone know of a better way to approach the problem of summarizing a Wikipedia article? I'd rather have a quick and dirty solution than something perfect that takes a long time to put together. Any general advice would also be welcome.

Was it helpful?

Solution

Considering that your question relates more to a research activity than a programming problem, you should probably look at scientific literature. Here you will find published details of a number of algorithms that perform exactly what you want. A google search for "keyword summarization" finds the following:

Single document Summarization based on Clustering Coefficient and Transitivity Analysis

Multi-document Summarization for Query Answering E-learning System

Intelligent Email: Aiding Users with AI

If you read the above, then follow the references they contain, you will find a whole wealth of information. Certainly enough to build a functional application.

OTHER TIPS

Just my two cents...

Whenever I'm browsing a new subject on Wikipedia, I typically perform a "breadth-first" search; I refuse to move on to another topic until I've scanned each and every link that the page connects to (which introduces a topic I'm not already familiar with). I read the first sentence of each paragraph, and if I see something in that article that appears to relate to the original topic, I repeat the process.

If I were to design the interface for a Wikipedia "summarizer", I would

  1. Always print the entire introductory paragraph.

  2. For the rest of the article, print any sentence that has a link in it.

    2a. Print any comma separated lists of links as a bullet pointed list.

  3. If the link to the article is "expanded", print the first paragraph for that article.

  4. If that introductory paragraph is expanded, repeat the listing of sentences with links.

This process could repeat indefinitely.

What I'm saying is that summarizing Wikipedia articles isn't the same as summarizing an article from a magazine, or a posting on a blog. The act of crawling is an important part of learning introductory concepts quickly via Wikipedia, and I feel it's for the best. Typically, the bottom half of articles is where the citation needed tags start popping up, but the first half of any given article is considered given knowledge by the community.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top