public final class PdfTextExtractor extends TextExtractor implements IPageTextExtractor, IContainer, ISearchable, IHighlightExtractor, IRegexSearchable, IDocumentContentExtractor, IFastTextExtractor
Provides the text extractor for PDF documents.
Extracting a text from PDF:
// Create a text extractor for PDFs
PdfTextExtractor extractor = new PdfTextExtractor(stream);
// Extract a text
System.out.println(extractor.extractAll());
Extracting by pages:
// Create a text extractor for PDFs
PdfTextExtractor extractor = new PdfTextExtractor(stream);
// Iterate pages
for (int pageIndex = 0; pageIndex < extractor.getPageCount(); pageIndex++) {
// Extract a text from the page which index is pageIndex
System.out.println(extractor.extractPage(pageIndex));
}
Constructor and Description |
---|
PdfTextExtractor(InputStream stream)
Initializes a new instance of the
PdfTextExtractor class. |
PdfTextExtractor(InputStream stream,
LoadOptions loadOptions)
Initializes a new instance of the
PdfTextExtractor class. |
PdfTextExtractor(String fileName)
Initializes a new instance of the
PdfTextExtractor class. |
PdfTextExtractor(String fileName,
LoadOptions loadOptions)
Initializes a new instance of the
PdfTextExtractor class. |
Modifier and Type | Method and Description |
---|---|
protected void |
dispose(boolean disposing)
Releases the unmanaged resources used by the extractor.
|
List<String> |
extractHighlights(HighlightOptions... highlightOptions)
Extracts highlights.
|
String |
extractPage(int pageIndex)
Reads all characters from the page with
pageIndex and returns the data as a string. |
protected String |
extractText()
Extracts all characters from the current position to the end of the text extractor
and returns them as one string.
|
protected String |
extractTextLine()
Extracts a line of characters from the text extractor and returns the data as a string.
|
DocumentContent |
getDocumentContent()
Gets an access to the document's content.
|
List<Container.Entity> |
getEntities()
Gets a collection of container's entities.
|
int |
getExtractMode()
Gets a value indicating the mode of text extraction.
|
Dictionary<String,String> |
getFormData()
Extracts a PDF Forms data from document.
|
int |
getPageCount()
Gets a total count of the pages.
|
TableAreaDetector |
getTableAreaDetector()
Gets a table detector.
|
TableAreaParser |
getTableAreaParser()
Gets a table parser.
|
InputStream |
openEntityStream(Container.Entity entity)
Opens a stream with the content of the container's entity.
|
protected String |
prepareLine()
Returns a line of the text.
|
void |
reset()
Resets the current document.
|
void |
search(SearchOptions options,
ISearchHandler handler,
ISearchEngine searchEngine,
List<String> keywords)
Searches the keywords.
|
void |
search(SearchOptions options,
ISearchHandler handler,
List<String> keywords)
Searches the keywords.
|
void |
searchWithRegex(String expression,
ISearchHandler handler,
RegexSearchOptions searchOptions)
Searches the expression.
|
void |
setExtractMode(int value)
Sets a value indicating the mode of text extraction.
|
checkDisposed, close, dispose, extractAll, extractLine, getEncoding, getMediaType, getPassword, isDisposed, setEncoding, setMediaType
public PdfTextExtractor(String fileName)
Initializes a new instance of the PdfTextExtractor
class.
fileName
- The path to the file.public PdfTextExtractor(String fileName, LoadOptions loadOptions)
Initializes a new instance of the PdfTextExtractor
class.
fileName
- The path to the file.loadOptions
- The options of loading the file.public PdfTextExtractor(InputStream stream)
Initializes a new instance of the PdfTextExtractor
class.
stream
- The stream of the document.public PdfTextExtractor(InputStream stream, LoadOptions loadOptions)
Initializes a new instance of the PdfTextExtractor
class.
stream
- The stream of the document.loadOptions
- The options of loading the file.public DocumentContent getDocumentContent()
Gets an access to the document's content.
getDocumentContent
in interface IDocumentContentExtractor
DocumentContent
class.public TableAreaDetector getTableAreaDetector()
Gets a table detector.
TableAreaDetector
.public TableAreaParser getTableAreaParser()
Gets a table parser.
TableAreaParser
.public int getExtractMode()
Gets a value indicating the mode of text extraction.
getExtractMode
in interface IFastTextExtractor
Standard
.public void setExtractMode(int value)
Sets a value indicating the mode of text extraction.
setExtractMode
in interface IFastTextExtractor
value
- The mode of text extraction. The default is Standard
.public int getPageCount()
Gets a total count of the pages.
getPageCount
in interface IPageTextExtractor
public List<Container.Entity> getEntities()
Gets a collection of container's entities.
getEntities
in interface IContainer
public Dictionary<String,String> getFormData()
Extracts a PDF Forms data from document.
public InputStream openEntityStream(Container.Entity entity)
Opens a stream with the content of the container's entity.
openEntityStream
in interface IContainer
entity
- A container's entity.public void search(SearchOptions options, ISearchHandler handler, List<String> keywords)
Searches the keywords.
search
in interface ISearchable
options
- Options for searching.handler
- An instance of the search handler.keywords
- A collection of words to search.public void search(SearchOptions options, ISearchHandler handler, ISearchEngine searchEngine, List<String> keywords)
Searches the keywords.
search
in interface ISearchable
options
- Options for searching.handler
- An instance of the search handler.searchEngine
- An instance of the search engine.keywords
- A collection of words to search.public void searchWithRegex(String expression, ISearchHandler handler, RegexSearchOptions searchOptions)
Searches the expression.
searchWithRegex
in interface IRegexSearchable
expression
- A regular expression.handler
- An instance of the search handler.searchOptions
- Options for searching.public List<String> extractHighlights(HighlightOptions... highlightOptions)
Extracts highlights.
extractHighlights
in interface IHighlightExtractor
highlightOptions
- A collection of HighlightOptions.public String extractPage(int pageIndex)
Reads all characters from the page with pageIndex
and returns the data as a string.
extractPage
in interface IPageTextExtractor
pageIndex
- The index of the page.public void reset()
Resets the current document.
ExtractLine
method will return the first line of the document.
reset
in class TextExtractor
protected void dispose(boolean disposing)
Releases the unmanaged resources used by the extractor.
dispose
in class TextExtractor
disposing
- A boolean true if invoked from Dispose; otherwise, false.protected String extractTextLine()
TextExtractor
Extracts a line of characters from the text extractor and returns the data as a string.
extractTextLine
in class TextExtractor
protected String extractText()
Extracts all characters from the current position to the end of the text extractor and returns them as one string.
extractText
in class TextExtractor
protected String prepareLine()
Returns a line of the text.
prepareLine
in class TextExtractor