How to extract bullet information from word document?

https://stackoverflow.com/questions/2302077

21-09-2019
|

Question

I want to extract information of bullets present in word document. I want something like this : Suppose the text below, is in word document :

Steps to Start car :

Open door
Sit inside
Close the door
Insert key
etc.

Then I want my text file like below :

Steps to Start car :

<BULET> Sit inside </BULET>

<BULET> Close the door </BULET>

<BULET> Insert key </BULET>

I am using C# language to do this.

I can extract paragraphs from word document and directly write them in text file with some formatting information like whether text is bold or is in italics, etc. but dont know how to extract this bullet information.

Can anyone please tell me how to do this?

Thanks in advance

Solution 3

I got the answer.....

First I was converting doc on paragraph basis. But instead of that if we process doc file sentence by sentence basis, it is possible to determine whether that sentence contains bullet or any kind of shape or if that sentence is part of table. So once we get this information, then we can convert that sentence appropriately. If someone needs source code, I can share it.

OTHER TIPS

You can do it by reading each sentence. doc.Sentences is an array of Range object. So you can get same Range object from Paragraph.

        foreach (Paragraph para in oDoc.Paragraphs)
        {
            string paraNumber = para.Range.ListFormat.ListLevelNumber.ToString();
            string bulletStr = para.Range.ListFormat.ListString;
            MessageBox.Show(paraNumber + "\t" + bulletStr + "\t" + para.Range.Text);
        }

Into paraNumber you can get paragraph level and into buttetStr you can get bullet as string.

I am using this OpenXMLPower tool by Eric White. Its free and available at NUGet package. you can install it from Visual studio package manager.

He has provided a ready to use code snippet. This tool has saved me many hours. Below is the way I have customized code snippet to use for my requirement. Infact you can use these methods as it in your project.

 private static WordprocessingDocument _wordDocument;
 private StringBuilder textItemSB = new StringBuilder();
 private List<string> textItemList = new List<string>();


/// Open word document using office SDK and reads all contents from body of document
/// </summary>
/// <param name="filepath">path of file to be processed</param>
/// <returns>List of paragraphs with their text contents</returns>
private void GetDocumentBodyContents()
{
    string modifiedString = string.Empty;
    List<string> allList = new List<string>();
    List<string> allListText = new List<string>();

    try
    {
_wordDocument = WordprocessingDocument.Open(wordFileStream, false);
        //RevisionAccepter.AcceptRevisions(_wordDocument);
        XElement root = _wordDocument.MainDocumentPart.GetXDocument().Root;
        XElement body = root.LogicalChildrenContent().First();
        OutputBlockLevelContent(_wordDocument, body);
    }
    catch (Exception ex)
    {
        logger.Error("ERROR in GetDocumentBodyContents:" + ex.Message.ToString());
    }
}


// This is recursive method. At each iteration it tries to fetch listitem and Text item. Once you have these items in hand 
// You can manipulate and create your own collection.
private void OutputBlockLevelContent(WordprocessingDocument wordDoc, XElement blockLevelContentContainer)
{
    try
    {
        string listItem = string.Empty, itemText = string.Empty, numberText = string.Empty;
        foreach (XElement blockLevelContentElement in
            blockLevelContentContainer.LogicalChildrenContent())
        {
            if (blockLevelContentElement.Name == W.p)
            {
                listItem = ListItemRetriever.RetrieveListItem(wordDoc, blockLevelContentElement);
                itemText = blockLevelContentElement
                    .LogicalChildrenContent(W.r)
                    .LogicalChildrenContent(W.t)
                    .Select(t => (string)t)
                    .StringConcatenate();
                if (itemText.Trim().Length > 0)
                {
                    if (null == listItem)
                    {
                        // Add html break tag 
                        textItemSB.Append( itemText + "<br/>");
                    }
                    else
                    {
                        //if listItem == "" bullet character, replace it with equivalent html encoded character                                   
                        textItemSB.Append("&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;" + (listItem == "" ? "&bull;" : listItem) + "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;" + itemText + "<br/>");
                    }
                }
                else if (null != listItem)
                {
                    //If bullet character is found, replace it with equivalent html encoded character  
                    textItemSB.Append(listItem == "" ? "&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&bull;" : listItem);
                }
                else
                    textItemSB.Append("<blank>");
                continue;
            }
            // If element is not a paragraph, it must be a table.

            foreach (var row in blockLevelContentElement.LogicalChildrenContent())
            {                        
                foreach (var cell in row.LogicalChildrenContent())
                {                            
                    // Cells are a block-level content container, so can call this method recursively.
                    OutputBlockLevelContent(wordDoc, cell);
                }
            }
        }
        if (textItemSB.Length > 0)
        {
            textItemList.Add(textItemSB.ToString());
            textItemSB.Clear();
        }
    }
    catch (Exception ex)
    {
        .....
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow