com.groupdocs.parser

Interfaces

Classes

Exceptions

com.groupdocs.parser

Class WordsTextExtractor

  • All Implemented Interfaces:
    IHighlightExtractor, IRegexSearchable, ISearchable, IStructuredExtractor, AutoCloseable


    public final class WordsTextExtractor
    extends TextExtractor
    implements ISearchable, IHighlightExtractor, IRegexSearchable, IStructuredExtractor

    Provides the text extractor for text documents.


    Supported formats:

    .DOCMicrosoft Word Text document
    .DOTMicrosoft Word Text template
    .DOCXMicrosoft Office Open XML Text document
    .DOCMMicrosoft Word 2007 Master document
    .RTFRich Text Format text file
    .ODTOpenDocument text
    .TXTPlain text
    .HTML (.XHTML, .HTM)Hypertext Markup Language document
    .MHTML (.MHT)Web Archive Single File

    Extracting a text from a text document:

     // Create a text extractor for text documents
     WordsTextExtractor extractor = new WordsTextExtractor(stream);
     // Extract a text
     System.out.println(extractor.extractAll());
      
    • Method Detail

      • search

        public void search(SearchOptions options,
                  ISearchHandler handler,
                  List<String> keywords)

        Searches the keywords.

        Specified by:
        search in interface ISearchable
        Parameters:
        options - Options for searching.
        handler - An instance of the search handler.
        keywords - A collection of words to search.
      • search

        public void search(SearchOptions options,
                  ISearchHandler handler,
                  ISearchEngine searchEngine,
                  List<String> keywords)

        Searches the keywords.

        Specified by:
        search in interface ISearchable
        Parameters:
        options - Options for searching.
        handler - An instance of the search handler.
        searchEngine - An instance of the search engine.
        keywords - A collection of words to search.
      • extractHighlights

        public List<String> extractHighlights(HighlightOptions... highlightOptions)

        Extracts highlights.

        Specified by:
        extractHighlights in interface IHighlightExtractor
        Parameters:
        highlightOptions - A collection of HighlightOptions.
        Returns:
        A collection of strings that represent highlights. If no highlight is found, a collection is empty.
      • reset

        public void reset()

        Resets the current document.


        Resets the cursor's position. ExtractLine method will return the first line of the document.

        Overrides:
        reset in class TextExtractor
      • dispose

        protected void dispose(boolean disposing)

        Releases the unmanaged resources used by the extractor.

        Overrides:
        dispose in class TextExtractor
        Parameters:
        disposing - A boolean true if invoked from Dispose; otherwise, false.
      • extractText

        protected String extractText()

        Extracts all characters from the current position to the end of the text extractor and returns them as one string.

        Overrides:
        extractText in class TextExtractor
        Returns:
        A string that contains all characters from the current position to the end of the text extractor.
      • prepareLine

        protected String prepareLine()

        Returns a line of the text.

        Specified by:
        prepareLine in class TextExtractor
        Returns:
        A string that represents a line of the text, or null if all characters have been read.