com.groupdocs.parser

Interfaces

Classes

Exceptions

com.groupdocs.parser

Class PdfTextExtractor

    • Constructor Detail

      • PdfTextExtractor

        public PdfTextExtractor(String fileName)

        Initializes a new instance of the PdfTextExtractor class.

        Parameters:
        fileName - The path to the file.
      • PdfTextExtractor

        public PdfTextExtractor(String fileName,
                        LoadOptions loadOptions)

        Initializes a new instance of the PdfTextExtractor class.

        Parameters:
        fileName - The path to the file.
        loadOptions - The options of loading the file.
      • PdfTextExtractor

        public PdfTextExtractor(InputStream stream)

        Initializes a new instance of the PdfTextExtractor class.

        Parameters:
        stream - The stream of the document.
      • PdfTextExtractor

        public PdfTextExtractor(InputStream stream,
                        LoadOptions loadOptions)

        Initializes a new instance of the PdfTextExtractor class.

        Parameters:
        stream - The stream of the document.
        loadOptions - The options of loading the file.
    • Method Detail

      • getDocumentContent

        public DocumentContent getDocumentContent()

        Gets an access to the document's content.

        Returns:
        An instance of DocumentContent class.
      • getExtractMode

        public int getExtractMode()

        Gets a value indicating the mode of text extraction.

        Returns:
        The mode of text extraction. The default is Standard.
      • setExtractMode

        public void setExtractMode(int value)

        Sets a value indicating the mode of text extraction.

        Parameters:
        value - The mode of text extraction. The default is Standard.
      • getPageCount

        public int getPageCount()

        Gets a total count of the pages.

        Specified by:
        getPageCount in interface IPageTextExtractor
        Returns:
        A total count of the pages.
      • openEntityStream

        public InputStream openEntityStream(Container.Entity entity)

        Opens a stream with the content of the container's entity.

        Specified by:
        openEntityStream in interface IContainer
        Parameters:
        entity - A container's entity.
        Returns:
        A stream with the content of the container's entity.
      • search

        public void search(SearchOptions options,
                  ISearchHandler handler,
                  List<String> keywords)

        Searches the keywords.

        Specified by:
        search in interface ISearchable
        Parameters:
        options - Options for searching.
        handler - An instance of the search handler.
        keywords - A collection of words to search.
      • search

        public void search(SearchOptions options,
                  ISearchHandler handler,
                  ISearchEngine searchEngine,
                  List<String> keywords)

        Searches the keywords.

        Specified by:
        search in interface ISearchable
        Parameters:
        options - Options for searching.
        handler - An instance of the search handler.
        searchEngine - An instance of the search engine.
        keywords - A collection of words to search.
      • extractHighlights

        public List<String> extractHighlights(HighlightOptions... highlightOptions)

        Extracts highlights.

        Specified by:
        extractHighlights in interface IHighlightExtractor
        Parameters:
        highlightOptions - A collection of HighlightOptions.
        Returns:
        A collection of strings that represent highlights. If no highlight is found, a collection is empty.
      • extractPage

        public String extractPage(int pageIndex)

        Reads all characters from the page with pageIndex and returns the data as a string.

        Specified by:
        extractPage in interface IPageTextExtractor
        Parameters:
        pageIndex - The index of the page.
        Returns:
        A string that contains all characters from the page, or null if all characters have been read.
      • reset

        public void reset()

        Resets the current document.


        Resets the cursor's position. ExtractLine method will return the first line of the document.

        Overrides:
        reset in class TextExtractor
      • dispose

        protected void dispose(boolean disposing)

        Releases the unmanaged resources used by the extractor.

        Overrides:
        dispose in class TextExtractor
        Parameters:
        disposing - A boolean true if invoked from Dispose; otherwise, false.
      • extractTextLine

        protected String extractTextLine()
        Description copied from class: TextExtractor

        Extracts a line of characters from the text extractor and returns the data as a string.

        Overrides:
        extractTextLine in class TextExtractor
        Returns:
        The next line from the extractor, or null if all characters have been extracted.
      • extractText

        protected String extractText()

        Extracts all characters from the current position to the end of the text extractor and returns them as one string.

        Overrides:
        extractText in class TextExtractor
        Returns:
        A string that contains all characters from the current position to the end of the text extractor.
      • prepareLine

        protected String prepareLine()

        Returns a line of the text.

        Specified by:
        prepareLine in class TextExtractor
        Returns:
        A string that represents a line of the text, or null if all characters have been read.