available.packages by publication date

https://stackoverflow.com/questions/8830439

r
cran

15-04-2021
|

Question

Is it possible to get the publication date of CRAN packages from within R? I would like to get a list of the k most recently published CRAN packages, or alternatively all packages published after date dd-mm-yy. Similar to the information on the available_packages_by_date.html?

The available.packages() command has a "fields" argument, but this only extracts fields from the DESCRIPTION. The date field on the package description is not always up-to-date.

I can get it with a smart regex from the html page, but I am not sure how reliable and up-to-date the this html file is... At some point Kurt might decide to give the layout a makeover which would break the script. An alternative is to use timestamps from the CRAN FTP but I am also not sure how good this solution is. I am not sure if there is somewhere a formally structured file with publication dates? I assume the HTML page is automatically generated from some DB.

Solution

Turns out there is an undocmented file "packages.rds" which contains the publication dates (not times) of all packages. I suppose these data are used to recreate the HTML file every day.

Below a simple function that extracts publication dates from this file:

recent.packages.rds <- function(){
    mytemp <- tempfile();
    download.file("http://cran.r-project.org/web/packages/packages.rds", mytemp);
    mydata <- as.data.frame(readRDS(mytemp), row.names=NA);
    mydata$Published <- as.Date(mydata[["Published"]]);

    #sort and get the fields you like:
    mydata <- mydata[order(mydata$Published),c("Package", "Version", "Published")];
}

OTHER TIPS

The best approach is to take advantage of the fact the package DESCRIPTION is published on the cran mirror, and since the DESCRIPTION is from the build package, it contains information about exactly when it was packaged:

pkgs <- unname(available.packages()[, 1])[1:20]
desc_urls <- paste("http://cran.r-project.org/web/packages/", pkgs, "/DESCRIPTION", sep = "")
desc <- lapply(desc_urls, function(x) read.dcf(url(x)))

sapply(desc, function(x) x[, "Packaged"])
sapply(desc, function(x) x[, "Date/Publication"])

(I'm restricting it to the first 20 packages here to illustrate the basic idea)

Here a function that uses the HTML and regular expressions. I still rather get the information from a more formal place though in case the HTML ever changes layout.

recent.packages <- function(number=10){

    #html is malformed
    maxlines <- number*2 + 11
    mytemp <- tempfile()
    if(getOption("repos") == "@CRAN@"){
        repo <- "http://cran.r-project.org"
    } else {
        repo <- getOption("repos");
    }
    newurl <- paste(repo,"/web/packages/available_packages_by_date.html", sep="");
    download.file(newurl, mytemp);
    datastring <- readLines(mytemp, n=maxlines)[12:maxlines];

    #we only find packages from after 2010-01-01
    myexpr1 <- '201[0-9]-[0-9]{2}-[0-9]{2} </td> <td> <a href="../../web/packages/[a-zA-Z0-9\\.]{2,}/'
    myexpr2 <- '^201[0-9]-[0-9]{2}-[0-9]{2}'
    myexpr3 <- '[a-zA-Z0-9\\.]{2,}/$'
    newpackages <- unlist(regmatches(datastring, gregexpr(myexpr1, datastring)));
    newdates <- unlist(regmatches(newpackages, gregexpr(myexpr2, newpackages)));
    newnames <- unlist(regmatches(newpackages, gregexpr(myexpr3, newpackages)));

    newdates <- as.Date(newdates);
    newnames <- substring(newnames, 1, nchar(newnames)-1);
    returndata <- data.frame(name=newnames, date=newdates);
    return(head(returndata, number));
}

So here a solution that uses the dir listing from the FTP. It is a little tricky because the FTP gives the date in linux format with either a timestamp or a year. Other than that it does it's job. I'm still not convinced this is reliable though. If packages are copied over to another server all timestmaps might be reset.

recent.packages.ftp <- function(){
    setwd(tempdir())
    download.file("ftp://cran.r-project.org/pub/R/src/contrib/", destfile=tempfile(), method="wget", extra="--no-htmlify");

    #because of --no-htmlify the destfile argument does not work
    datastring <- readLines(".listing");
    unlink(".listing");

    myexpr1 <- "(?<date>[A-Z][a-z]{2} [0-9]{2} [0-9]{2}:[0-9]{2}) (?<name>[a-zA-Z0-9\\.]{2,})_(?<version>[0-9\\.-]*).tar.gz$"
    matches <- gregexpr(myexpr1, datastring, perl=TRUE);
    packagelines <- as.logical(sapply(regmatches(datastring, matches), length));

    #subset proper lines
    matches <- matches[packagelines];
    datastring <- datastring[packagelines];
    N <- length(matches)

    #from the ?regexpr manual       
    parse.one <- function(res, result) {
        m <- do.call(rbind, lapply(seq_along(res), function(i) {
            if(result[i] == -1) return("")
            st <- attr(result, "capture.start")[i, ]
            substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
        }))
        colnames(m) <- attr(result, "capture.names")
        m
    }

    #parse all records
    mydf <- data.frame(date=rep(NA, N), name=rep(NA, N), version=rep(NA,N))
    for(i in 1:N){
        mydf[i,] <- parse.one(datastring[i], matches[[i]]);
    }
    row.names(mydf) <- NULL;
    #convert dates
    mydf$date <- strptime(mydf$date, format="%b %d %H:%M");

    #So linux only displays dates for packages of less then six months old. 
    #However strptime will assume the current year for packages that don't have a timestamp
    #Therefore for dates that are in the future, we subtract a year. We can use some margin for timezones. 
    infuture <- (mydf$date > Sys.time() + 31*24*60*60);
    mydf$date[infuture] <- mydf$date[infuture] - 365*24*60*60;

    #sort and return
    mydf <- mydf[order(mydf$date),];
    row.names(mydf) <- NULL;
    return(mydf);
}

You could process the page http://cran.r-project.org/src/contrib/, and split the fields by whitespace in order to obtain the fully specified package source filename, which includes the version # and a .gz suffix.

There are a few other items in the list that are not package files, such as the .rds files, various subdirectories, and so on.

Barring changes in how the directory structure is presented or the locations of the files, I can't think of anything more authoritative than this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow