com.groupdocs.parser

Interfaces

Classes

Exceptions

com.groupdocs.parser

Class WordsFormattedTextExtractor

  • All Implemented Interfaces:
    IHighlightExtractor, IPageTextExtractor, ITextExtractorWithFormatter, AutoCloseable


    public final class WordsFormattedTextExtractor
    extends TextExtractor
    implements IPageTextExtractor, IHighlightExtractor, ITextExtractorWithFormatter

    Provides the formatted text extractor for text documents.


    Supported formats:

    .DOCMicrosoft Word Text document
    .DOTMicrosoft Word Text template
    .DOCXMicrosoft Office Open XML Text document
    .DOCMMicrosoft Word 2007 Master document
    .RTFRich Text Format text file
    .ODTOpenDocument text
    .HTML (.XHTML, .HTM)Hypertext Markup Language document
    .MHTML (.MHT)Web Archive Single File

    Extracting text from document:

     // Create a formatted text extractor for text documents
     WordsFormattedTextExtractor extractor = new WordsFormattedTextExtractor(stream);
     // Extract a formatted text
     System.out.println(extractor.extractAll());
      

    Extracting by pages:

     // Create a formatted text extractor for text documents
     WordsFormattedTextExtractor extractor = new WordsFormattedTextExtractor(stream);
     // Iterate pages
     for (int pageIndex = 0; pageIndex < extractor.getPageCount(); pageIndex++) {
         // Extract a formatted text from the page which index is pageIndex
         System.out.println(extractor.extractPage(pageIndex));
     }
      

    For setting a formatter DocumentFormatter property is used.

     // Create a formatted text extractor for text documents
     WordsFormattedTextExtractor extractor = new WordsFormattedTextExtractor(stream);
     // Set a markdown formatter for formatting
     extractor.setDocumentFormatter(new MarkdownDocumentFormatter()); // all the text will be formatted as Markdown
      

    By default a text is formatted as a plain text by PlainDocumentFormatter.

    • Method Detail

      • getDocumentFormatter

        public DocumentFormatter getDocumentFormatter()

        Gets a DocumentFormatter.

        Specified by:
        getDocumentFormatter in interface ITextExtractorWithFormatter
        Returns:
        An instance of the DocumentFormatter. The default is PlainDocumentFormatter.


        By default the value is an instance of PlainDocumentFormatter class. You can set any other formatter or null, if you want to use default formatter.

      • setDocumentFormatter

        public void setDocumentFormatter(DocumentFormatter value)

        Sets a DocumentFormatter.

        Specified by:
        setDocumentFormatter in interface ITextExtractorWithFormatter
        Parameters:
        value - An instance of the DocumentFormatter. The default is PlainDocumentFormatter.


        By default the value is an instance of PlainDocumentFormatter class. You can set any other formatter or null, if you want to use default formatter.

      • getPageCount

        public int getPageCount()

        Gets a total count of the pages.

        Specified by:
        getPageCount in interface IPageTextExtractor
        Returns:
        A total count of the pages.
      • extractPage

        public String extractPage(int pageIndex)

        Extracts all characters from the page with pageIndex and returns the data as a string.

        Specified by:
        extractPage in interface IPageTextExtractor
        Parameters:
        pageIndex - The index of the page.
        Returns:
        A string that contains all characters from the page, or null if all characters have been extracted.
      • reset

        public void reset()

        Resets the current document.


        Resets the cursor's position. ExtractLine method will return the first line of the document.

        Overrides:
        reset in class TextExtractor
      • extractHighlights

        public List<String> extractHighlights(HighlightOptions... highlightOptions)

        Extracts highlights.

        Specified by:
        extractHighlights in interface IHighlightExtractor
        Parameters:
        highlightOptions - A collection of HighlightOptions.


        Supports only the extraction with Mode = FixedWidth.

        Returns:
        A collection of strings that represent highlights. If no highlight is found, a collection is empty.
        Throws:
        UnsupportedOperationException - Mode is not FixedWith.
      • extractText

        protected String extractText()

        Extracts all characters from the current position to the end of the text extractor and returns them as one string.

        Overrides:
        extractText in class TextExtractor
        Returns:
        A string that contains all characters from the current position to the end of the text extractor.
      • extractTextLine

        protected String extractTextLine()

        Extracts a line of characters from the text extractor and returns the data as a string.

        Overrides:
        extractTextLine in class TextExtractor
        Returns:
        The next line from the extractor, or null if all characters have been extracted.
      • prepareLine

        protected String prepareLine()

        Returns a line of the text.

        Specified by:
        prepareLine in class TextExtractor
        Returns:
        A string that represents a line of the text, or null if all characters have been read.