Question

I have a large text in a database that has three main components : The entire Text , The paragraphs that makes the main text, and the words (tokens) in each paragraph.

For each of these 3 components there will be a certain related linked content. For example each paragraph will have a list of other text content that discuss same subject of the paragraph gathered from many resources (scholarly works , contemporary thoughts etc..) I want to design a playframework model to model this association between each of the 3 components with its own related content categories (scholarly works , contemporary thoughts etc..)

How can I design a clean playframework 1.x Models that reflects this linking in a granular base from the main text , to its paragraphs to the tokens of each paragraphs and the association with the related text content categories. I'm sure there is some good design pattern to model this scenario. Can someone suggest me a clean solution?

Was it helpful?

Solution

I suggest you store the text once in its entirety and then have a flexible class hierarchy to index the contents.

I've put the Hibernate annotations only where special.

You create a container to hold the text itself and the part objects:

public class DocumentContainer extends Model {
    // Column definition depends on the DB, here: MySQL
    @Column(columnDefinition="LONGTEXT")
    public String text;

    public Set<DocumentPart> documentParts;
}

A part of a document is defined on a region of the text, is of a certain type, and can reference other parts of documents:

@Entity
@Inheritance(strategy=InheritanceType.JOINED)
@DiscriminatorColumn(name="partType")
public class DocumentPart extends Model {

    Document document;

    // indices over the documents text for this part
    int startIndex;
    int endIndex;

    @Enumerated(EnumType.STRING)
    PartType partType;

    Set<DocumentPart> referencedParts;
}

public enum PartType {
    DOCUMENT, PARAGRAPH, TOKEN
}

A paragraph would then be, for example:

@Entity
@DiscriminatorValue("PARAGRAPH")
public class Paragraph extends DocumentPart {
     Set<Token> tokens;
}

In this way, you're flexible regarding as to what types of regions you have over your document, and you can preserve the whole document (including punctuation etc.).

OTHER TIPS

From what you have written, you could go with...

@Entity
public class Document extends Model {
    public List<Paragraph> paragraphs;
}

    @Entity public class Paragraph extends Model { public List words; public List citations; }

@Entity
public class Citation extends Model {
    public String type;
    public URL linkedResource; // is resource external?
    public List<Document> // is resource internal to this system?
}

You weren't clear on the linkage of the citation, so I have given 2 options. You could go with either, or both.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top