Best way to represent formatted text in memory? C++

https://stackoverflow.com/questions/7284722

19-01-2021
|

题

I'm writing a basic text editor, well it's really an edit control box where I want to write code, numerical values and expressions for my main program.

The way I'm currently doing it is that I feed character strings into the edit control. In the edit control I have a class that breaks the string up into “glyphs” like words, numbers, line breaks, tabs, format tokens, etc. The word glyphs for example contain a string representing a literal word and a short integer that represents the number of trailing white spaces. The glyphs also contain info needed when drawing the text and calculating line wrapping.

For example the text line “My name is Karl” would equal a linked list of glyphs like this: NewLineGlyph → WordGlyph (“My”, 1 whitespace) → WordGlyph (“name”, 1 whitespace) → WordGlyph(“is”, 1 whitespace ) → WordGlyph (“Karl”, 0 whitespace) → NULL.

So instead of storing the string in memory as a continuous block of chars (or WCHARs), it is stored in small chunks with potentially lots of small allocations and deallocations.

My question is; should I be concerned with heap fragmentation when doing it this way? Do you have any tips on makinging this more efficient? Or a completely different way of doing it? :)

PS. I'm working in C++ on Win7.

解决方案

Should you be concerned about fragmentation? The answer likely depends on how large your documents are (e.g., number of words), and how much editing will occur and the nature of those edits. The approach you have outlined might be reasonable for a static (read-only) document where you can "parse" the document once, but I imagine there will be a fair amount of work that needs to happen behind the scenes to keep your data structures in the correct state as a user is making arbitrary edits. Also, you'll have to decide on what a "word" is, which isn't necessarily obvious/consistent in every case. For example, is "hard-working" one word or two? If it's one, does that mean you will never word wrap at the hyphen? Or, consider the case where a "word" will not fit on a single line. In that case, will you simply truncate, or would you want to force break the word across lines?

My recommendation would be store the text as a block, and store the line breaks separately (as offsets into the text block), then recalculate line breaks as needed each time there is a change. If you're concerned about fragmentation and minimizing the number of allocations/deallocations, you could allocate fixed-size blocks and then manage memory inside of those blocks yourself. Here's what I've done in the past:

Text is stored as a block of characters, but rather than having a single contiguous block for the entire document, I maintain a linked list of blocks that are always allocated 4KB (i.e., either 4K single-byte chars, or 2K WCHARs). In other words, the text is stored as a linked list of arrays, where each array is allocated to a constant size.
Each block keeps track of how much space (i.e., characters) are used/free within that block.
When inserting one or more characters, if there is space in the current block, I can simply shift memory within that block (no allocation/deallocation required). If no space is available in the current block, but space is available in the adjacent block, then again I can just shift memory between existing blocks (no allocation/deallocation required). If both blocks are full, only then do I allocate a new 4KB block and add at the appropriate position in the linked list.
When deleting one or more characters, I simply need to shift memory (at most 4KB) rather than entire document text. I also may have to deallocate and remove any block(s) that become completely empty.
I also do some "garbage collection" to coalesce free space at appropriate times. This is fairly straightforward and involves moving characters from one block to another so that some blocks become empty and can be removed.

From the OS and/or runtime library's perspective, all of the allocations/dellocations are the same size (4KB), so there is no fragmentation. And since I manage the contents of that memory, I can avoid fragmentation within my allocated space by shifting memory contents to eliminate wasted space. The other advantage is that it minimizes the number of alloc/dealloc calls, which can be a performance concern depending on what allocator you are using. So, it's an optimization for both speed and size -- how often does that happen? :-)

其他提示

I wouldn't worry about heap fragmentation; modern heap manager is pretty good at dealing with that.

I might worry about poor data locality, though. With each glyph as a separate allocation in a linked list (especially a non-invasive list like std::list), any sort of pass through the document is going to jump all over memory in a potentially non-cache-friendly way.

Text editors are harder than they seem at first blush. There are a lot of specialized data structures out there for representing blocks of text and structured documents. They each optimize for different types of operations. I recommend searching for explanations of them and then considering the types of operations you'll have to do most.

This paper is old, but it has a lot of good information: http://www.cs.unm.edu/~crowley/papers/sds.pdf

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow