SourceForge Research Data Archive (SRDA) is one of the data sources for my dissertation research. I'm having difficulty in debugging the following issue related to SRDA data collection.
Data collection from SRDA requires authentication and then submitting Web form with an SQL query. Upon successful processing of the query, the system generates a text file with query results. While testing my R code for SRDA data collection, I've changed the SQL request to make sure that the results file is being regenerated. However, I've discovered that the file contents stays the same (corresponds to previous query). I think that the lack of refresh of the file contents could be due to failure of either authentication, or query form submission. The following is the debug output from the code (https://github.com/abnova/diss-floss/blob/master/import/getSourceForgeData.R):
make importSourceForge
Rscript --no-save --no-restore --verbose getSourceForgeData.R
running
'/usr/lib/R/bin/R --slave --no-restore --no-save --no-restore --file=getSourceForgeData.R'
Loading required package: RCurl
Loading required package: methods
Loading required package: bitops
Loading required package: digest
Retrieving SourceForge data...
Checking request "SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727"...
* About to connect() to zerlot.cse.nd.edu port 80 (#0)
* Trying 129.74.152.47... * connected
> POST /mediawiki/index.php?title=Special:Userlogin&action=submitlogin&type=login HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Content-Length: 37
Content-Type: application/x-www-form-urlencoded
* upload completely sent off: 37out of 37 bytes
< HTTP/1.1 200 OK
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< X-Powered-By: PHP/5.2.4-2ubuntu5.25
* Added cookie wiki_db_session="c61...a3c" for domain zerlot.cse.nd.edu, path /, expire 0
< Set-Cookie: wiki_db_session=c61...a3c; path=/
< Content-language: en
< Vary: Accept-Encoding,Cookie
< Expires: Thu, 01 Jan 1970 00:00:00 GMT
< Cache-Control: private, must-revalidate, max-age=0
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
* Connection #0 to host zerlot.cse.nd.edu left intact
[1] "Before second postForm()"
* Re-using existing connection! (#0) with host zerlot.cse.nd.edu
* Connected to zerlot.cse.nd.edu (129.74.152.47) port 80 (#0)
> POST /cgi-bin/form.pl HTTP/1.1
Host: zerlot.cse.nd.edu
Accept: */*
Cookie: wiki_db_session=c61...a3c
Content-Length: 129
Content-Type: application/x-www-form-urlencoded
* upload completely sent off: 129out of 129 bytes
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 11 Mar 2014 03:49:04 GMT
< Server: Apache/2.2.8 (Ubuntu) PHP/5.2.4-2ubuntu5.25 with Suhosin-Patch
< Vary: Accept-Encoding
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/html
<
* Closing connection #0
Error: Internal Server Error
Execution halted
make: *** [importSourceForge] Error 1
I've tried to figure this out using debug output as well as Network protocol analyzer from Firefox embedded Developer Tools, but so far without much success. Would appreciate any advice and help.
UPDATE:
if (!require(RCurl)) install.packages('RCurl')
if (!require(digest)) install.packages('digest')
library(RCurl)
library(digest)
# Users must authenticate to access Query Form
SRDA_HOST_URL <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"
# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"
# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"
# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL <- "1" # add SQL to file
curl <<- getCurlHandle()
srdaLogin <- function (loginURL, username, password) {
curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
followlocation = TRUE, verbose = TRUE)
params <- list('wpName1' = username, 'wpPassword1' = password)
if(url.exists(loginURL)) {
reply <- postForm(loginURL, .params = params, curl = curl,
style = "POST")
#if (DEBUG) print(reply)
info <- getCurlInfo(curl)
return (ifelse(info$response.code == 200, TRUE, FALSE))
}
else {
error("Can't access login URL!")
}
}
srdaConvertRequest <- function (request) {
return (list(select = "*",
from = "sf1104.users a, sf1104.artifact b",
where = "b.artifact_id = 304727"))
}
srdaRequestData <- function (requestURL, select, from, where, sep, sql) {
params <- list('uitems' = select,
'utables' = from,
'uwhere' = where,
'useparator' = sep,
'append_query' = sql)
if(url.exists(requestURL)) {
reply <- postForm(requestURL, .params = params, #.opts = opts,
curl = curl, style = "POST")
}
}
srdaGetData <- function(request) {
resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
collapse="", sep="")
results.query <- readLines(resultsURL, n = 1)
return (ifelse(results.query == request, TRUE, FALSE))
}
getSourceForgeData <- function (request) {
# Construct SRDA login and query URLs
loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
collapse="", sep="")
queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")
# Log into the system
if (!srdaLogin(loginURL, USER, PASS))
error("Login failed!")
rq <- srdaConvertRequest(request)
srdaRequestData(queryURL,
rq$select, rq$from, rq$where, DATA_SEP, ADD_SQL)
if (!srdaGetData(request))
error("Data collection failed!")
}
message("\nTesting SourceForge data collection...\n")
getSourceForgeData("SELECT *
FROM sf1104.users a, sf1104.artifact b
WHERE a.user_id = b.submitted_by AND b.artifact_id = 304727")
# clean up
close(curl)
UPDATE 2 (no functions version):
if (!require(RCurl)) install.packages('RCurl')
library(RCurl)
# Users must authenticate to access Query Form
SRDA_HOST_URL <- "http://zerlot.cse.nd.edu"
SRDA_LOGIN_URL <- "/mediawiki/index.php?title=Special:Userlogin"
SRDA_LOGIN_REQ <- "&action=submitlogin&type=login"
# SRDA URL that Query Form sends POST requests to
SRDA_QUERY_URL <- "/cgi-bin/form.pl"
# SRDA URL that Query Form sends POST requests to
SRDA_QRESULT_URL <- "/qresult/blekh/blekh.txt"
# Parameters for result's format
DATA_SEP <- ":" # data separator
ADD_SQL <- "1" # add SQL to file
message("\nTesting SourceForge data collection...\n")
curl <- getCurlHandle()
curlSetOpt(curl = curl, cookiejar = 'cookies.txt',
ssl.verifyhost = FALSE, ssl.verifypeer = FALSE,
followlocation = TRUE, verbose = TRUE)
# === Authentication ===
loginParams <- list('wpName1' = USER, 'wpPassword1' = PASS)
loginURL <- paste(SRDA_HOST_URL, SRDA_LOGIN_URL, SRDA_LOGIN_REQ,
collapse="", sep="")
if (url.exists(loginURL)) {
postForm(loginURL, .params = loginParams, curl = curl, style = "POST")
info <- getCurlInfo(curl)
message("\nLogin results - HTTP status code: ", info$response.code, "\n\n")
} else {
error("\nCan't access login URL!\n\n")
}
# === Data collection ===
# Previous query was: "SELECT * FROM sf0305.users WHERE user_id < 100"
query <- list(select = "*",
from = "sf1104.users a, sf1104.artifact b",
where = "b.artifact_id = 304727")
getDataParams <- list('uitems' = query$select,
'utables' = query$from,
'uwhere' = query$where,
'useparator' = DATA_SEP,
'append_query' = ADD_SQL)
queryURL <- paste(SRDA_HOST_URL, SRDA_QUERY_URL, collapse="", sep="")
if(url.exists(queryURL)) {
postForm(queryURL, .params = getDataParams, curl = curl, style = "POST")
resultsURL <- paste(SRDA_HOST_URL, SRDA_QRESULT_URL,
collapse="", sep="")
results.query <- readLines(resultsURL, n = 1)
request <- paste(query$select, query$from, query$where)
if (results.query == request)
message("\nData request is successful, SQL query: ", request, "\n\n")
else
message("\nData request failed, SQL query: ", request, "\n\n")
} else {
error("\nCan't access data query URL!\n\n")
}
close(curl)
UPDATE 3 (server-side debugging)
Finally, I was able to get in touch with a person responsible for the system and he helped me to narrow down the issue to cookie management IMHO. Here's the error log record, corresponding to running my code:
[Fri Mar 21 15:33:14 2014] [error] [client 54.204.180.203] [Fri Mar 21
15:33:14 2014] form.pl: /tmp/sess_3e55593e436a013597cd320e4c6a2fac:
at /var/www/cgi-bin/form.pl line 43
The following is the snippet of the server-side script (Perl
) that generated that error (line #1 in the script is bash
interpreter directive, so reported line number 43 is most likely line number 44):
42 if (-e "/tmp/sess_$file") {
43 $session = PHP::Session->new($cgi->cookie("$session_name"));
44 $user_id = $session->get('wsUserID');
45 $user_name = $session->get('wsUserName');
The following is a session information (1) after authentication and (2) after submitting data request, obtained by tracing manual authentication and manual data request form submission:
(1) "wiki_dbUserID=449; expires=Sun, 20-Apr-2014 21:04:14 GMT;
path=/wiki_dbUserName=Blekh; expires=Sun, 20-Apr-2014 21:04:14 GMT;
path=/wiki_dbToken=deleted; expires=Thu, 21-Mar-2013 21:04:13 GMT"
(2) wiki_db_session=aaed058f97059174a59effe44b137cbc;
_ga=GA1.2.2065853334.1395410153; EDSSID=e24ff5ed891c28c61f2d1f8dec424274; wiki_dbUserName=Blekh;
wiki_dbLoggedOut=20140321210314; wiki_dbUserID=449
Would appreciate any help in figuring out the problem with my code!