Question

SourceForge Research Data Archive (SRDA) is one of the data sources for my dissertation research. I'm having difficulty in debugging the following issue related to SRDA data collection.

Data collection from SRDA requires authentication and then submitting Web form with an SQL query. Upon successful processing of the query, the system generates a text file with query results. While testing my R code for SRDA data collection, I've changed the SQL request to make sure that the results file is being regenerated. However, I've discovered that the file contents stays the same (corresponds to previous query). I think that the lack of refresh of the file contents could be due to failure of either authentication, or query form submission. The following is the debug output from the code (https://github.com/abnova/diss-floss/blob/master/import/getSourceForgeData.R):

make importSourceForge

Rscript --no-save --no-restore --verbose getSourceForgeData.R
running
  '/usr/lib/R/bin/R --slave --no-restore --no-save --no-restore --file=getSourceForgeData.R'

Loading required package: RCurl
Loading required package: methods
Loading required package: bitops
Loading required package: digest

Retrieving SourceForge data...

Checking request "SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"...
* About to connect() to zerlot.cse.nd.edu port 80 (#0)
*   Trying 129.74.152.47... * connected
> POST /mediawiki/index.php?title=Special:Userlogin&action=submitlogin&type=login HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Content-Length: 37
Content-Type: application/x-www-form-urlencoded

* upload completely sent off: 37out of 37 bytes
< HTTP/1.1 200 OK
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< X-Powered-By: PHP/5.2.4-2ubuntu5.25
* Added cookie wiki_db_session="c61...a3c" for domain zerlot.cse.nd.edu, path /, expire 0
< Set-Cookie: wiki_db_session=c61...a3c; path=/
< Content-language: en
< Vary: Accept-Encoding,Cookie
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Cache-Control: private, must-revalidate, max-age=0
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
* Connection #0 to host zerlot.cse.nd.edu left intact
[1] "Before second postForm()"
* Re-using existing connection! (#0) with host zerlot.cse.nd.edu
* Connected to zerlot.cse.nd.edu (129.74.152.47) port 80 (#0)
> POST /cgi-bin/form.pl HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Cookie: wiki_db_session=c61...a3c
Content-Length: 129
Content-Type: application/x-www-form-urlencoded

* upload completely sent off: 129out of 129 bytes
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< Vary: Accept-Encoding
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/html
<
* Closing connection #0
Error: Internal Server Error
Execution halted
make: *** [importSourceForge] Error 1

I've tried to figure this out using debug output as well as Network protocol analyzer from Firefox embedded Developer Tools, but so far without much success. Would appreciate any advice and help.

UPDATE:

if (!require(RCurl)) install.packages('RCurl')
if (!require(digest)) install.packages('digest')

library(RCurl)
library(digest)

# Users must authenticate to access Query Form
SRDA_HOST_URL  <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"

# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"

# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"

# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL  <- "1" # add SQL to file

curl <<- getCurlHandle()

srdaLogin <- function (loginURL, username, password) {

  curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
             ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
             followlocation = TRUE, verbose = TRUE)

  params <- list('wpName1' = username, 'wpPassword1' = password)

  if(url.exists(loginURL)) {
    reply <- postForm(loginURL, .params = params, curl = curl,
                      style = "POST")
    #if (DEBUG) print(reply)
    info <- getCurlInfo(curl)
    return (ifelse(info$response.code == 200, TRUE, FALSE))
  }
  else {
    error("Can't access login URL!")
  }
}


srdaConvertRequest <- function (request) {

  return (list(select = "*",
               from = "sf1104.users a, sf1104.artifact b",
               where = "b.artifact_id = 304727"))
}


srdaRequestData <- function (requestURL, select, from, where, sep, sql) {

  params <- list('uitems' = select,
                 'utables' = from,
                 'uwhere' = where,
                 'useparator' = sep,
                 'append_query' = sql)

  if(url.exists(requestURL)) {
    reply <- postForm(requestURL, .params = params, #.opts = opts,
                      curl = curl, style = "POST")
  }
}


srdaGetData <- function(request) {

  resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
                      collapse="", sep="")

  results.query <- readLines(resultsURL, n = 1)

  return (ifelse(results.query == request, TRUE, FALSE))
}


getSourceForgeData <- function (request) {

  # Construct SRDA login and query URLs
  loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
                    collapse="", sep="")
  queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")

  # Log into the system 
  if (!srdaLogin(loginURL, USER, PASS))
    error("Login failed!")

  rq <- srdaConvertRequest(request)

  srdaRequestData(queryURL,
                  rq$select, rq$from, rq$where, DATA_SEP, ADD_SQL)

  if (!srdaGetData(request))
    error("Data collection failed!")
}


message("\nTesting SourceForge data collection...\n")

getSourceForgeData("SELECT * 
FROM sf1104.users a, sf1104.artifact b 
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727")

# clean up
close(curl)

UPDATE 2 (no functions version):

if (!require(RCurl)) install.packages('RCurl')
library(RCurl)

# Users must authenticate to access Query Form
SRDA_HOST_URL  <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"

# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"

# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"

# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL  <- "1" # add SQL to file


message("\nTesting SourceForge data collection...\n")

curl <- getCurlHandle()

curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
           ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
           followlocation = TRUE, verbose = TRUE)

# === Authentication ===

loginParams <- list('wpName1' = USER, 'wpPassword1' = PASS)

loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
                  collapse="", sep="")

if (url.exists(loginURL)) {
  postForm(loginURL, .params = loginParams, curl = curl, style = "POST")
  info <- getCurlInfo(curl)
  message("\nLogin results - HTTP status code: ", info$response.code, "\n\n")
} else {
  error("\nCan't access login URL!\n\n")
}

# === Data collection ===

# Previous query was: "SELECT * FROM sf0305.users WHERE user_id < 100"
query <- list(select = "*",
              from = "sf1104.users a, sf1104.artifact b",
              where = "b.artifact_id = 304727") 

getDataParams <- list('uitems'       = query$select,
                      'utables'      = query$from,
                      'uwhere'       = query$where,
                      'useparator'   = DATA_SEP,
                      'append_query' = ADD_SQL)

queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")

if(url.exists(queryURL)) {
  postForm(queryURL, .params = getDataParams, curl = curl, style = "POST")
  resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
                      collapse="", sep="")
  results.query <- readLines(resultsURL, n = 1)
  request <- paste(query$select, query$from, query$where)
  if (results.query == request)
    message("\nData request is successful, SQL query: ", request, "\n\n")
  else
    message("\nData request failed, SQL query: ", request, "\n\n")
} else {
  error("\nCan't access data query URL!\n\n")
}

close(curl)

UPDATE 3 (server-side debugging)

Finally, I was able to get in touch with a person responsible for the system and he helped me to narrow down the issue to cookie management IMHO. Here's the error log record, corresponding to running my code:

[Fri Mar 21 15:33:14 2014] [error] [client 54.204.180.203] [Fri Mar 21 15:33:14 2014] form.pl: /tmp/sess_3e55593e436a013597cd320e4c6a2fac: at /var/www/cgi-bin/form.pl line 43

The following is the snippet of the server-side script (Perl) that generated that error (line #1 in the script is bash interpreter directive, so reported line number 43 is most likely line number 44):

42     if (-e "/tmp/sess_$file") {
43     $session = PHP::Session->new($cgi->cookie("$session_name"));
44     $user_id = $session->get('wsUserID');
45     $user_name = $session->get('wsUserName');

The following is a session information (1) after authentication and (2) after submitting data request, obtained by tracing manual authentication and manual data request form submission:

(1) "wiki_dbUserID=449; expires=Sun, 20-Apr-2014 21:04:14 GMT; path=/wiki_dbUserName=Blekh; expires=Sun, 20-Apr-2014 21:04:14 GMT; path=/wiki_dbToken=deleted; expires=Thu, 21-Mar-2013 21:04:13 GMT"

(2) wiki_db_session=aaed058f97059174a59effe44b137cbc; _ga=GA1.2.2065853334.1395410153; EDSSID=e24ff5ed891c28c61f2d1f8dec424274; wiki_dbUserName=Blekh; wiki_dbLoggedOut=20140321210314; wiki_dbUserID=449

Would appreciate any help in figuring out the problem with my code!

Was it helpful?

Solution 2

I've simplified the code still further:

library(httr)

base_url  <- "http://srda.cse.nd.edu"

loginURL <- modify_url(
  base_url, 
  path = "mediawiki/index.php", 
  query = list(
    title = "Special:Userlogin", 
    action = "submitlogin",
    type = "login",
    wpName1 = USER,
    wpPasswor1 = PASS
  )
)
r <- POST(loginURL)
stop_for_status(r)

queryURL <- modify_url(base_url, path = "cgi-bin/form.pl")
query <- list(
  uitems       = "user_name",
  utables      = "sf1104.users a, sf1104.artifact b",
  uwhere       = "a.user_id = b.submitted_by AND b.artifact_id = 304727",
  useparator   = ":",
  append_query = "1"
)
r <- POST(queryURL, body = query, multipart = FALSE)
stop_for_status(r)

But I'm still getting a 500. I tried:

  • setting extra cookies that I see in the browser (wiki_dbUserID, wiki_dbUserName)
  • setting header DNT to 1
  • setting referer to http://srda.cse.nd.edu/cgi-bin/form.pl
  • setting user-agent the same as chrome
  • setting accept "text/html"

OTHER TIPS

Finally, finally, finally! I have figured out what was causing this problem, which gave me so much headache (figuratively and literally). It forced me to spend a lot of time reading various Internet resources (including many SO questions and answers), debugging my code and communicating with people. I spent a lot of time, but not in vain, as I learned a lot about RCurl, cookies, Web forms and HTTP protocol.

The reason appeared much simpler than I thought. While the direct reason of the form submission failure was related to cookie management, the underlying reason was using wrong parameter names (IDs) of the authentication form fields. The two pairs were very similar and it took only one extra character to trigger the whole problem.

Lesson learned: when facing issues, especially ones dealing with authentication, it's very important to check all names and IDs multiple times and very carefully to make sure they correspond the ones supposed to be used. Thank you to everyone who was helping or trying to help me with this issue!

The following provides clarification for the scenario (error situation).

From W3C RFC 2616 - HTTP/1.1 Specification:

10.5 Server Error 5xx

Response status codes beginning with the digit "5" indicate cases in which the server is aware that it has erred or is incapable of performing the request. Except when responding to a HEAD request, the server SHOULD include an entity containing an explanation of the error situation, and whether it is a temporary or permanent condition. User agents SHOULD display any included entity to the user. These response codes are applicable to any request method.

10.5.1 500 Internal Server Error

The server encountered an unexpected condition which prevented it from fulfilling the request.

My interpretation of the paragraph 10.5 is that it implies that there should be a more detailed explanation of the error situation beyond the one provided in paragraph 10.5.1. However, I recognize that it very well may be that the message for status code 500 (paragraph 10.5.1) is considered sufficient. Confirmations for either of interpretations are welcome!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top