Get all documents with GridFSOperations

https://stackoverflow.com/questions/22396822

14-06-2023
|

Pergunta

I've decided to move one of our projects from PostgreSQL to MongoDB and this project deals with images. I am able to save images and retrieve them by their _id now but I couldn't find a function with GridFSOperations where I could safely get all documents. I am doing this so that I could take photo meta-data I saved with the image and index them with Lucene (as I needed a full text search on some relevant metadata, also future possible scenarios where we might need to rebuild the Lucene index)

In the old code, I simply had a function with an offset and limit for the SQL query as I found out (the hard way) that our dev system can only do bulk Lucene adds in groups of 5k. Is there an equivalent way of doing this with GridFS?

Edit:

function inherited from the old interface:

public List<Photo> getPublicPhotosForReindexing(long offset, long limit) {
    List<Photo> result = new ArrayList<>();
    List<GridFSDBFile> files = gridFsOperations.find(new Query().limit((int) limit).skip((int) offset));
    for(GridFSDBFile file:files) {
        result.add(convertToPhoto(file));
    }
    return result;
}

a simple converter taking parts of the metadata and putting it in the POJO I made:

private Photo convertToPhoto(GridFSDBFile fsFile) {
    Photo resultPhoto = new Photo(fsFile.getId().toString());
    try {
        resultPhoto
                .setOriginalFilename(fsFile.getFilename())
     //         .setPhotoData(IOUtils.toByteArray(fsFile.getInputStream()))
                .setDateAdded(fsFile.getUploadDate());
    } catch (Exception e) {
        logger.error("Should not hit this one", e);
    }
    return resultPhoto;
}

Solução

When you are using GridFS the information is stored in your MongoDB database in two collections. The first is fs.files which has the main reference to the file and fs.chunks that actually holds the "chunks" of data. See the examples

Collection: fs.files

{
    "_id" : ObjectId("53229d20f3dde871df8b89a7"),
    "filename" : "receptor.jpg",
    "chunkSize" : 262144,
    "uploadDate" : ISODate("2014-03-14T06:09:36.462Z"),
    "md5" : "f1e71af6d0ba9c517280f33b4cbab3f9",
    "length" : 138905
}

Collection: fs.chunks

{
    "_id" : ObjectId("53229d20824b12efe88cc1f2"),
    "files_id" : ObjectId("53229d20f3dde871df8b89a7"),
    "n" : 0,
    "data" : // all of the binary data

}

So really these are just normal MongoDB documents and normal collections.

As you can see, there are various ways you can "query" these collections with the standard API:

The Object Id is monotonic and therefore ever increasing. Newer entries will have a higher ObjectId value than older ones. Most importantly, the last Id that was indexed.
The updloadDate also holds a general date timestamp that you can use for date range based queries.

So you see, that GridFS is really just "Driver level magic" to work with ordinary MongoDB documents, and treat the binary data as a single document.

As they are just normal collections with normal documents, unless you are retrieving or otherwise updating the content, then just use the normal methods to select and find.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow