Question

What logic does Wikipedia use to generate the full URL for images in articles, given the [[File:...]] tag in the wikimedia markup, or the infobox |image=... line?

The url seems to always start with http://upload.wikimedia.org/wikipedia/commons/, followed by two url segments that do not seem to be predictable, plus the image name which can be predicted from the tag.

For example,

[[File:Michael Jordan UNC Jersey cropped.jpg|thumb|left|Michael Jordan's jersey in the rafters of The [[Dean Smith Center]]]]

yields:

http://upload.wikimedia.org/wikipedia/commons/thumb/6/6a/Michael_Jordan_UNC_Jersey_cropped.jpg/220px-Michael_Jordan_UNC_Jersey_cropped.jpg

Is there any way to programatically determine the /6/6a part of the url, or is this a lookup on a Wikipedia server?

Another example, in the Infobox:

|image = Jordan Lipofsky.jpg

yields:

http://upload.wikimedia.org/wikipedia/commons/b/b3/Jordan_Lipofsky.jpg

Can the /b/b3 portion of the URL be predicted?

Thanks!

Was it helpful?

Solution

Turns out it's generated from an MD5 hash. Something like the below in Scala will work -- although I'm not sure how to predict whether it's in /commons or in /en.

import org.apache.commons.codec.digest.DigestUtils
def getImageUrl(fileName: String, rootUrl: String): String = {

    val messageDigest = DigestUtils.md5Hex(fileName.replace(" ", "_"))
    val md5 = messageDigest

    val hash1 = md5.substring(0, 1)
    val hash2 = md5.substring(0, 2)

    rootUrl + hash1 + "/" + hash2 + "/" + fileName

}

Careful about leading zeros, as discussed here:

Does wikipedia use different methods to compute the hash part of an image path?

http://lists.wikimedia.org/pipermail/mediawiki-api/2011-December/thread.html#2446

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top