Вопрос

The short version

How would I go about blocking the access to a file until a specific function that both involves read and write processes to that very file has returned?


The use case

I often want to create some sort of central registry and there might be more than one R process involved in reading from and writing to that registry (in kind of a "poor man's parallelization" setting where different processes run independently from each other except with respect to the registry access).

I would not like to depend on any DBMS such as SQLite, PostgreSQL, MongoDB etc. early on in the devel process. And even though I later might use a DBMS, a filesystem-based solution might still be a handy fallback option. Thus I'm curious how I could realize it with base R functionality (at best).

I'm aware that having a lot of reads and writes to the file system in a parallel setting is not very efficient compared to DBMS solutions.

I'm running on MS Windows 8.1 (64 Bit)

What I'd like to get a deeper understanding of

What actually exactly happens when two or more R processes try to write to or read from a file at the same time? Does the OS figure out the "accesss order" automatically and does the process that "came in second" wait or does it trigger an error as the file access might is blocked by the first process? How could I prevent the second process from returning with an error but instead "just wait" until it's his turn?

Shared workspace of processes

Besides the rredis Package: are there any other options for shared memory on MS Windows?

Illustration

Path to registry file:

path_registry <- file.path(tempdir(), "registry.rdata")

Example function that registers events:

registerEvent <- function(
    id=gsub("-| |:", "", Sys.time()), 
    values, 
    path_registry
) {
    if (!file.exists(path_registry)) {
        registry <- new.env()
        save(registry, file=path_registry)
    } else {
        load(path_registry)
    }

    message("Simulated additional runtime between reading and writing (5 seconds)")
    Sys.sleep(5)

    if (!exists(id, envir=registry, inherits=FALSE)) {
        assign(id, values, registry)
        save(registry, file=path_registry)
        message(sprintf("Registering with ID %s", id))
        out <- TRUE
    } else {
        message(sprintf("ID %s already registered", id))
        out <- FALSE
    }
    out
}

Example content that is registered:

x <- new.env()
x$a <- TRUE
x$b <- letters[1:5]

Note that the content usually is "nested", i.e. RDBMS would not be really "useful" anyway or at least would involve some normalization steps before writing to the DB. That's why I prefer environments (unique variable IDs and pass-by-reference is possible) over lists and, if one does make the step to use a true DBMS, I would rather turn NoSQL approaches such as MongoDB.

Registration cycle:

The actual calls might be spread over different processes, so there is a possibility of concurrent access atempts.

I want to have other processes/calls "wait" until a registerEvent read-write cycle is finished before doing their read-write cycle (without triggering errors).

registerEvent(values=list(x_1=x, x_2=x), path_registry=path_registry)
registerEvent(values=list(x_1=x, x_2=x), path_registry=path_registry)
registerEvent(id="abcd", values=list(x_1=x, x_2=x), 
    path_registry=path_registry)
registerEvent(id="abcd", values=list(x_1=x, x_2=x), 
    path_registry=path_registry)

Check registry content:

load(path_registry)
ls(registry)
Это было полезно?

Решение

See filelock R package, available since 2018. It is cross-platform. I am using it on Windows and have not found a single problem.

Make sure to read the documentation.

?filelock::lock

Although the docs suggest to leave the lock file, I have had no problems removing it on function exit in a multi-process environment:

on.exit({filelock::unlock(lock); file.remove(path.lock)})
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top