public class Parser extends Object implements Closeable
Constructor and Description |
---|
Parser(Connection connection)
Initializes a new instance of the
Parser class to extract data from a database. |
Parser(Connection connection,
ParserSettings parserSettings)
Initializes a new instance of the
Parser class to extract data from a database. |
Parser(EmailConnection connection)
Initializes a new instance of the
Parser class. |
Parser(EmailConnection connection,
ParserSettings parserSettings)
Initializes a new instance of the
Parser class. |
Parser(InputStream document)
Initializes a new instance of the
Parser class. |
Parser(InputStream document,
LoadOptions loadOptions)
Initializes a new instance of the
Parser class with LoadOptions . |
Parser(InputStream document,
LoadOptions loadOptions,
ParserSettings parserSettings)
Initializes a new instance of the
Parser class with LoadOptions . |
Parser(String filePath)
Initializes a new instance of the
Parser class. |
Parser(String filePath,
LoadOptions loadOptions)
Initializes a new instance of the
Parser class with LoadOptions . |
Parser(String filePath,
LoadOptions loadOptions,
ParserSettings parserSettings)
Initializes a new instance of the
Parser class with LoadOptions . |
Modifier and Type | Method and Description |
---|---|
void |
close()
Closes this resource, relinquishing any underlying resources.
|
void |
generatePreview(PreviewOptions previewOptions)
Get pages preview.
|
Iterable<ContainerItem> |
getContainer()
Extracts a container object from the document to work with formats that contain attachments, ZIP archives etc.
|
IDocumentInfo |
getDocumentInfo()
Returns the general information about the document.
|
Features |
getFeatures()
Gets the supported features.
|
static FileInfo |
getFileInfo(InputStream document)
Returns the general information about a file.
|
static FileInfo |
getFileInfo(String filePath)
Returns the general information about a file.
|
TextReader |
getFormattedText(FormattedTextOptions options)
Extracts a formatted text from the document.
|
TextReader |
getFormattedText(int pageIndex,
FormattedTextOptions options)
Extracts a formatted text from the document page.
|
HighlightItem |
getHighlight(int position,
boolean isDirect,
HighlightOptions options)
Extracts a highlight from the document.
|
Iterable<PageHyperlinkArea> |
getHyperlinks()
Extracts hyperlinks from the document.
|
Iterable<PageHyperlinkArea> |
getHyperlinks(int pageIndex)
Extracts hyperlinks from the document page.
|
Iterable<PageHyperlinkArea> |
getHyperlinks(int pageIndex,
PageAreaOptions options)
Extracts hyperlinks from the document page using customization options (to set the rectangular area that contains hyperlinks).
|
Iterable<PageHyperlinkArea> |
getHyperlinks(PageAreaOptions options)
Extracts hyperlinks from the document using customization options (to set the rectangular area that contains hyperlinks).
|
Iterable<PageImageArea> |
getImages()
Extracts images from the document.
|
Iterable<PageImageArea> |
getImages(int pageIndex)
Extracts images from the document page.
|
Iterable<PageImageArea> |
getImages(int pageIndex,
PageAreaOptions options)
Extracts images from the document page using customization options (to set the rectangular area that contains images).
|
Iterable<PageImageArea> |
getImages(PageAreaOptions options)
Extracts images from the document using customization options (to set the rectangular area that contains images).
|
Iterable<MetadataItem> |
getMetadata()
Extracts metadata from the document.
|
Document |
getStructure()
Extracts a structured text from the document.
|
Iterable<PageTableArea> |
getTables(int pageIndex,
PageTableAreaOptions options)
Extracts tables from the document page.
|
Iterable<PageTableArea> |
getTables(PageTableAreaOptions options)
Extracts tables from the document.
|
TextReader |
getText()
Extracts a text from the document.
|
TextReader |
getText(int pageIndex)
Extracts a text from the document page.
|
TextReader |
getText(int pageIndex,
TextOptions options)
Extracts a text from the document page using text options (to enable raw fast text extraction mode).
|
TextReader |
getText(TextOptions options)
Extracts a text page from the document using text options (to enable raw fast text extraction mode).
|
Iterable<PageTextArea> |
getTextAreas()
Extracts text areas from the document.
|
Iterable<PageTextArea> |
getTextAreas(int pageIndex)
Extracts text areas from the document page.
|
Iterable<PageTextArea> |
getTextAreas(int pageIndex,
PageTextAreaOptions options)
Extracts text areas from the document page using customization options (regular expression, match case, etc.).
|
Iterable<PageTextArea> |
getTextAreas(PageTextAreaOptions options)
Extracts text areas from the document using customization options (regular expression, match case, etc.).
|
Iterable<TocItem> |
getToc()
Extracts a table of contents from the document.
|
DocumentData |
parseByTemplate(Template template)
Parses the document by the user-generated template.
|
DocumentData |
parseForm()
Parses the document form.
|
Iterable<SearchResult> |
search(String keyword)
Searches a keyword in the document.
|
Iterable<SearchResult> |
search(String keyword,
SearchOptions options)
Searches a keyword in the document using search options (regular expression, match case, etc.).
|
public Parser(Connection connection)
Parser
class to extract data from a database.
Learn more:
The following example shows how to extract data from Sqlite database:
// Create DbConnection object
java.sql.Connection connection = java.sql.DriverManager.getConnection(String.format("jdbc:sqlite:%s", Constants.SampleDatabase));
// Create an instance of Parser class to extract tables from the database
try (Parser parser = new Parser(connection)) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Check if toc extraction is supported
if (!parser.getFeatures().isToc()) {
System.out.println("Toc extraction isn't supported.");
return;
}
// Get a list of tables
Iterable<TocItem> toc = parser.getToc();
// Iterate over tables
for(TocItem i : toc)
{
// Print the table name
System.out.println(i.extractText());
// Extract a table content as a text
try(TextReader reader = parser.getText(i.getPageIndex().intValue()))
{
System.out.println(reader.readToEnd());
}
}
}
connection
- The database connection.public Parser(Connection connection, ParserSettings parserSettings)
Parser
class to extract data from a database.
Learn more:
The following example shows how to extract data from Sqlite database:
// Create DbConnection object
java.sql.Connection connection = java.sql.DriverManager.getConnection(String.format("jdbc:sqlite:%s", Constants.SampleDatabase));
// Create an instance of Parser class to extract tables from the database
try (Parser parser = new Parser(connection)) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Check if toc extraction is supported
if (!parser.getFeatures().isToc()) {
System.out.println("Toc extraction isn't supported.");
return;
}
// Get a list of tables
Iterable<TocItem> toc = parser.getToc();
// Iterate over tables
for(TocItem i : toc)
{
// Print the table name
System.out.println(i.extractText());
// Extract a table content as a text
try(TextReader reader = parser.getText(i.getPageIndex().intValue()))
{
System.out.println(reader.readToEnd());
}
}
}
connection
- The database connection.parserSettings
- The parser settings which are used to customize data extraction.public Parser(EmailConnection connection)
Parser
class.
Learn more:
The following example shows how to extract emails from Exchange Server:
// Create the connection object for Exchange Web Services protocol
EmailConnection connection = new EmailEwsConnection(
"https://outlook.office365.com/ews/exchange.asmx",
"email@server",
"password");
// Create an instance of Parser class to extract emails from the remote server
try (Parser parser = new Parser(connection)) {
// Check if container extraction is supported
if (!parser.getFeatures().isContainer()) {
System.out.println("Container extraction isn't supported.");
return;
}
// Extract email messages from the server
Iterable<ContainerItem> emails = parser.getContainer();
// Iterate over attachments
for (ContainerItem item : emails) {
// Create an instance of Parser class for email message
try (Parser emailParser = item.openParser()) {
// Extract the email text
try (TextReader reader = emailParser.getText()) {
// Print the email text
System.out.println(reader == null ? "Text extraction isn't supported." : reader.readToEnd());
}
}
}
}
connection
- The email connection.public Parser(EmailConnection connection, ParserSettings parserSettings)
Parser
class.
Learn more:
The following example shows how to extract emails from Exchange Server:
// Create the connection object for Exchange Web Services protocol
EmailConnection connection = new EmailEwsConnection(
"https://outlook.office365.com/ews/exchange.asmx",
"email@server",
"password");
// Create an instance of Parser class to extract emails from the remote server
try (Parser parser = new Parser(connection)) {
// Check if container extraction is supported
if (!parser.getFeatures().isContainer()) {
System.out.println("Container extraction isn't supported.");
return;
}
// Extract email messages from the server
Iterable<ContainerItem> emails = parser.getContainer();
// Iterate over attachments
for (ContainerItem item : emails) {
// Create an instance of Parser class for email message
try (Parser emailParser = item.openParser()) {
// Extract the email text
try (TextReader reader = emailParser.getText()) {
// Print the email text
System.out.println(reader == null ? "Text extraction isn't supported." : reader.readToEnd());
}
}
}
}
connection
- The email connection.parserSettings
- The parser settings which are used to customize data extraction.public Parser(String filePath)
Parser
class.
Learn more:
The following example shows how to load the document from the local disk:
// Set the filePath
String filePath = Constants.SamplePdf;
// Create an instance of Parser class with the filePath
try (Parser parser = new Parser(filePath)) {
// Extract a text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}
filePath
- The path to the file.public Parser(String filePath, LoadOptions loadOptions)
Parser
class with LoadOptions
.
Learn more:
The document password is passed by LoadOptions
class:
try {
String password = "123456";
// Create an instance of Parser class with the password:
try (Parser parser = new Parser(Constants.SamplePassword, new LoadOptions(password))) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Print the document text
try (TextReader reader = parser.getText()) {
System.out.println(reader.readToEnd());
}
}
} catch (InvalidPasswordException ex) {
// Print the message if the password is incorrect or empty
System.out.println("Invalid password");
}
filePath
- The path to the file.loadOptions
- The options to open the file.public Parser(String filePath, LoadOptions loadOptions, ParserSettings parserSettings)
Parser
class with LoadOptions
. and ParserSettings
Learn more:
The following example shows how to receive the information via ILogger
interface:
try {
// Create an instance of Logger class
Logger logger = new Logger();
// Create an instance of Parser class with the parser settings
try (Parser parser = new Parser(Constants.SamplePassword, null, new ParserSettings(logger))) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Print the document text
try (TextReader reader = parser.getText()) {
System.out.println(reader.readToEnd());
}
}
} catch (InvalidPasswordException | IOException ex) {
; // Ignore the exception
}
class Logger implements ILogger {
public void error(String message, Exception exception) {
// Print error message
System.out.println("Error: " + message);
}
public void trace(String message) {
// Print event message
System.out.println("Event: " + message);
}
public void warning(String message) {
// Print warning message
System.out.println("Warning: " + message);
}
}
filePath
- The path to the file.loadOptions
- The options to open the file.parserSettings
- The parser settings which are used to customize data extraction.public Parser(InputStream document)
Parser
class.
Learn more:
The following example shows how to load the document from the stream:
// Create the stream
try (InputStream stream = new FileInputStream(Constants.SamplePdf)) {
// Create an instance of Parser class with the stream
try (Parser parser = new Parser(stream)) {
// Extract a text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}
}
document
- The source input stream.public Parser(InputStream document, LoadOptions loadOptions)
Parser
class with LoadOptions
.
Learn more:
In some cases it's necessary to define FileFormat
. Both for special cases (databases, email server)
and for detecting file types by the content:
// Create an instance of Parser class for markdown document
try (Parser parser = new Parser(stream, new LoadOptions(FileFormat.Markup))) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
try (TextReader reader = parser.getText()) {
// Print the document text
// Markdown is detected; text without special symbols is printed
System.out.println(reader.readToEnd());
}
}
document
- The source input stream.loadOptions
- The options to open the file.public Parser(InputStream document, LoadOptions loadOptions, ParserSettings parserSettings)
Parser
class with LoadOptions
. and ParserSettings
Learn more:
The following example shows how to receive the information via ILogger
interface:
try {
// Create an instance of Logger class
Logger logger = new Logger();
// Create an instance of Parser class with the parser settings
try (Parser parser = new Parser(Constants.SamplePassword, null, new ParserSettings(logger))) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Print the document text
try (TextReader reader = parser.getText()) {
System.out.println(reader.readToEnd());
}
}
} catch (InvalidPasswordException | IOException ex) {
; // Ignore the exception
}
class Logger implements ILogger {
public void error(String message, Exception exception) {
// Print error message
System.out.println("Error: " + message);
}
public void trace(String message) {
// Print event message
System.out.println("Event: " + message);
}
public void warning(String message) {
// Print warning message
System.out.println("Warning: " + message);
}
}
document
- The source input stream.loadOptions
- The options to open the file.parserSettings
- The parser settings which are used to customize data extraction.public static FileInfo getFileInfo(String filePath) throws IOException
The following code shows how to check whether a file is password-protected:
// Get a file info
FileInfo info = Parser.getFileInfo(filePath);
// Check IsEncrypted property
System.out.println(info.isEncrypted() ? "Password is required" : "");
filePath
- The path to the file.FileInfo
class.IOException
- If an I/O errors occurs.public static FileInfo getFileInfo(InputStream document)
The following code shows how to check whether a file is password-protected:
// Get a file info
FileInfo info = Parser.getFileInfo(filePath);
// Check IsEncrypted property
System.out.println(info.isEncrypted() ? "Password is required" : "");
document
- The source input stream.FileInfo
class.public Features getFeatures()
Learn more:
If the feature isn't supported, the method returns null
instead of the value. Some operations may consume
significant time. So it's not optimal to call the method to just check the support for the feature.
For this purpose Features property is used:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleZip)) {
// Check if text extraction is supported for the document
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported");
return;
}
// Extract a text from the document
try (TextReader reader = parser.getText()) {
System.out.println(reader.readToEnd());
}
}
Features
class that represents the supported features.public void generatePreview(PreviewOptions previewOptions)
previewOptions
- The options to sets requirements and stream delegates for preview generation.public IDocumentInfo getDocumentInfo()
Learn more:
The following example shows how to get document info:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Get the document info
IDocumentInfo info = parser.getDocumentInfo();
// Print document information
System.out.println(String.format("FileType: %s", info.getFileType()));
System.out.println(String.format("PageCount: %d", info.getPageCount()));
System.out.println(String.format("Size: %d", info.getSize()));
}
IDocumentInfo
interface.public TextReader getText()
Learn more:
The following example shows how to extract a text from a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Extract a text into the reader
try (TextReader reader = parser.getText()) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}
TextReader
class with the extracted text; null
if text extraction isn't supported.public TextReader getText(TextOptions options)
Learn more:
The following example shows how to extract a raw text from a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Extract a raw text into the reader
try (TextReader reader = parser.getText(new TextOptions(true))) {
// Print a text from the document
// If text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
}
}
options
- The text extraction options.TextReader
class with the extracted text; null
if text extraction isn't supported.public TextReader getText(int pageIndex)
Learn more:
The following example shows how to extract a text from the document page:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Check if the document supports text extraction
if (!parser.getFeatures().isText()) {
System.out.println("Document isn't supports text extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.getPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p)) {
// Print a text from the document
// We ignore null-checking as we have checked text extraction feature support earlier
System.out.println(reader.readToEnd());
}
}
}
pageIndex
- The zero-based page index.TextReader
class with the extracted text; null
if text page extraction isn't supported.public TextReader getText(int pageIndex, TextOptions options)
Learn more:
The following example shows how to extract a raw text from the document page:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Check if the document supports text extraction
if (!parser.getFeatures().isText()) {
System.out.println("Document isn't supports text extraction.");
return;
}
// Get the document info
DocumentInfo documentInfo = parser.getDocumentInfo() instanceof DocumentInfo
? (DocumentInfo) parser.getDocumentInfo()
: null;
// Check if the document has pages
if (documentInfo == null || documentInfo.getRawPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.getRawPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a text into the reader
try (TextReader reader = parser.getText(p, new TextOptions(true))) {
// Print a text from the document
// We ignore null-checking as we have checked text extraction feature support earlier
System.out.println(reader.readToEnd());
}
}
}
pageIndex
- The zero-based page index.options
- The text extraction options.TextReader
class with the extracted text; null
if text page extraction isn't supported.public TextReader getFormattedText(FormattedTextOptions options)
Learn more:
The following example shows how to extract a document text as HTML text:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Extract a formatted text into the reader
try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
// Print a formatted text from the document
// If formatted text extraction isn't supported, a reader is null
System.out.println(reader == null ? "Formatted text extraction isn't suppported" : reader.readToEnd());
}
}
options
- The formatted text extraction options.TextReader
class with the extracted text; null
if formatted text extraction isn't supported.public TextReader getFormattedText(int pageIndex, FormattedTextOptions options)
Learn more:
The following example shows how to extract a document page text as Markdown text:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Check if the document supports formatted text extraction
if (!parser.getFeatures().isFormattedText()) {
System.out.println("Document isn't supports formatted text extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int p = 0; p < documentInfo.getPageCount(); p++) {
// Print a page number
System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
// Extract a formatted text into the reader
try (TextReader reader = parser.getFormattedText(p, new FormattedTextOptions(FormattedTextMode.Markdown))) {
// Print a formatted text from the document
// We ignore null-checking as we have checked formatted text extraction feature support earlier
System.out.println(reader.readToEnd());
}
}
}
pageIndex
- The zero-based page index.options
- The formatted text extraction options.TextReader
class with the extracted text; null
if formatted text page extraction isn't supported.public Iterable<SearchResult> search(String keyword)
Learn more:
The following example shows how to find a keyword in a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Search a keyword:
Iterable<SearchResult> sr = parser.search("lorem");
// Check if search is supported
if (sr == null) {
System.out.println("Search isn't supported");
return;
}
// Iterate over search results
for (SearchResult s : sr) {
// Print an index and found text:
System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
}
}
keyword
- The keyword to search.SearchResult
objects; null
if search isn't supported.public Iterable<SearchResult> search(String keyword, SearchOptions options)
Learn more:
The following example shows how to search with a regular expression in a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Search with a regular expression with case matching
Iterable<SearchResult> sr = parser.search("[0-9]+", new SearchOptions(true, false, true));
// Check if search is supported
if (sr == null) {
System.out.println("Search isn't supported");
return;
}
// Iterate over search results
for (SearchResult s : sr) {
// Print an index and found text:
System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
}
}
The following example shows how to search a text on pages:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Search a keyword with page numbers
Iterable<SearchResult> sr = parser.search("lorem", new SearchOptions(false, false, false, true));
// Check if search is supported
if (sr == null) {
System.out.println("Search isn't supported");
return;
}
// Iterate over search results
for (SearchResult s : sr) {
// Print an index, page number and found text:
System.out.println(String.format("At %d (%d): %s", s.getPosition(), s.getPageIndex(), s.getText()));
}
}
keyword
- The keyword to search.options
- The search options.SearchResult
objects; null
if search isn't supported.public HighlightItem getHighlight(int position, boolean isDirect, HighlightOptions options)
Learn more:
The following example shows how to extract a highlight that contains 3 words:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SamplePdf)) {
// Extract a highlight:
HighlightItem hl = parser.getHighlight(2, true, new HighlightOptions(10, 3));
// Check if highlight extraction is supported
if (hl == null) {
System.out.println("Highlight extraction isn't supported");
return;
}
// Print an extracted highlight
System.out.println(String.format("At %d: %s", hl.getPosition(), hl.getText()));
}
position
- The start position of the highlight.isDirect
- The value that indicates whether highlight extraction is direct. true
if the higlight is extracted by the right of position; otherwise, false
.options
- The highlight extraction options.HighlightItem
class that represents the extracted highlight; null
if highlight extraction isn't supported.public Iterable<TocItem> getToc()
Learn more:
The following example shows how to extract table of contents from EPUB file:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleEpub)) {
// Check if text extraction is supported
if (!parser.getFeatures().isText()) {
System.out.println("Text extraction isn't supported.");
return;
}
// Check if toc extraction is supported
if (!parser.getFeatures().isToc()) {
System.out.println("Toc extraction isn't supported.");
return;
}
// Get table of contents
Iterable<TocItem> toc = parser.getToc();
// Iterate over items
for (TocItem i : toc) {
// Print the Toc text
System.out.println(i.getText());
// Check if page index has a value
if (i.getPageIndex() == null) {
continue;
}
// Extract a page text
try (TextReader reader = parser.getText(i.getPageIndex())) {
System.out.println(reader.readToEnd());
}
}
}
null
if table of contents extraction isn't supported.public Iterable<MetadataItem> getMetadata()
Learn more:
The following example shows how to extract metadata from a document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleDocx)) {
// Extract metadata from the document
Iterable<MetadataItem> metadata = parser.getMetadata();
// Check if metadata extraction is supported
if (metadata == null) {
System.out.println("Metatada extraction isn't supported");
}
// Iterate over metadata items
for (MetadataItem item : metadata) {
// Print an item name and value
System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
}
}
null
if metadata extraction isn't supported.public Iterable<ContainerItem> getContainer()
Learn more:
The following example shows how to extract a text from zip entities:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleZip)) {
// Extract attachments from the container
Iterable<ContainerItem> attachments = parser.getContainer();
// Check if container extraction is supported
if (attachments == null) {
System.out.println("Container extraction isn't supported");
}
// Iterate over zip entities
for (ContainerItem item : attachments) {
// Print the file path
System.out.println(item.getFilePath());
try {
// Create Parser object for the zip entity content
try (Parser attachmentParser = item.openParser()) {
// Extract an zip entity text
try (TextReader reader = attachmentParser.getText()) {
System.out.println(reader == null ? "No text" : reader.readToEnd());
}
}
} catch (UnsupportedDocumentFormatException ex) {
System.out.println("Isn't supported.");
}
}
}
null
if container extraction isn't supported.public Iterable<PageTextArea> getTextAreas()
Learn more:
The following example shows how to extract all text areas from the whole document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Extract text areas
Iterable<PageTextArea> areas = parser.getTextAreas();
// Check if text areas extraction is supported
if (areas == null) {
System.out.println("Page text areas extraction isn't supported");
return;
}
// Iterate over page text areas
for (PageTextArea a : areas) {
// Print a page index, rectangle and text area value:
System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
}
}
PageTextArea
objects; null
if text areas extraction isn't supported.public Iterable<PageTextArea> getTextAreas(PageTextAreaOptions options)
Learn more:
The following example shows how to extract only text areas with digits from the upper-left courner:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Create the options which are used for text area extraction
PageTextAreaOptions options = new PageTextAreaOptions("\\s[a-z]{2}\\s", new Rectangle(new Point(0, 0), new Size(300, 100)));
// Extract text areas which contain only digits from the upper-left corner of a page:
Iterable<PageTextArea> areas = parser.getTextAreas(options);
// Check if text areas extraction is supported
if (areas == null) {
System.out.println("Page text areas extraction isn't supported");
return;
}
// Iterate over page text areas
for (PageTextArea a : areas) {
// Print a page index, rectangle and text area value:
System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
}
}
options
- The options for text area extraction.PageTextArea
objects; null
if text areas extraction isn't supported.public Iterable<PageTextArea> getTextAreas(int pageIndex)
Learn more:
To extract text areas from a document page the following method is used:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Check if the document supports text areas extraction
if (!parser.getFeatures().isTextAreas()) {
System.out.println("Document isn't supports text areas extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Iterate over page text areas
// We ignore null-checking as we have checked text areas extraction feature support earlier
for (PageTextArea a : parser.getTextAreas(pageIndex)) {
// Print a rectangle and text area value:
System.out.println(String.format("R: %s, Text: %s", a.getRectangle(), a.getText()));
}
}
}
pageIndex
- The zero-based page index.PageTextArea
objects; null
if text areas extraction isn't supported.public Iterable<PageTextArea> getTextAreas(int pageIndex, PageTextAreaOptions options)
Learn more:
pageIndex
- The zero-based page index.options
- The options for text area extraction.PageTextArea
objects; null
if text areas extraction isn't supported.public Iterable<PageImageArea> getImages()
Learn more:
The following example shows how to extract all images from the whole document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Extract images
Iterable<PageImageArea> images = parser.getImages();
// Check if images extraction is supported
if (images == null) {
System.out.println("Images extraction isn't supported");
return;
}
// Iterate over images
for (PageImageArea image : images) {
// Print a page index, rectangle and image type:
System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
}
}
PageImageArea
objects; null
if images extraction isn't supported.public Iterable<PageImageArea> getImages(PageAreaOptions options)
Learn more:
The following example shows how to extract only images from the upper-left courner:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Create the options which are used for images extraction
PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(340, 150), new Size(300, 100)));
// Extract images from the upper-left corner of a page:
Iterable<PageImageArea> images = parser.getImages(options);
// Check if images extraction is supported
if (images == null) {
System.out.println("Page images extraction isn't supported");
return;
}
// Iterate over images
for (PageImageArea image : images) {
// Print a page index, rectangle and image type:
System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
}
}
options
- The options for images extraction.PageImageArea
objects; null
if images extraction isn't supported.public Iterable<PageImageArea> getImages(int pageIndex)
Learn more:
To extract images from a document page the following method is used:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
// Check if the document supports images extraction
if (!parser.getFeatures().isImages()) {
System.out.println("Document isn't supports images extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Iterate over images
// We ignore null-checking as we have checked images extraction feature support earlier
for (PageImageArea image : parser.getImages(pageIndex)) {
// Print a rectangle and image type
System.out.println(String.format("R: %s, Text: %s", image.getRectangle(), image.getFileType()));
}
}
}
pageIndex
- The zero-based page index.PageImageArea
objects; null
if images extraction isn't supported.public Iterable<PageImageArea> getImages(int pageIndex, PageAreaOptions options)
Learn more:
pageIndex
- The zero-based page index.options
- The options for images extraction.PageImageArea
objects; null
if images extraction isn't supported.public Iterable<PageHyperlinkArea> getHyperlinks()
The following example shows how to extract all hyperlinks from the whole document:
// Create an instance of Parser class
try (Parser parser = new Parser(filePath)) {
// Check if the document supports hyperlink extraction
if (!parser.getFeatures().isHyperlinks()) {
System.out.println("Document isn't supports hyperlink extraction.");
return;
}
// Extract hyperlinks from the document
Iterable<PageHyperlinkArea> hyperlinks = parser.getHyperlinks();
// Iterate over hyperlinks
for (PageHyperlinkArea h : hyperlinks) {
// Print the hyperlink text
System.out.println(h.getText());
// Print the hyperlink URL
System.out.println(h.getUrl());
System.out.println();
}
}
PageHyperlinkArea
objects; null
if hyperlinks extraction isn't supported.public Iterable<PageHyperlinkArea> getHyperlinks(int pageIndex)
The following example shows how to extract hyperlinks from the document page:
// Create an instance of Parser class
try (Parser parser = new Parser(filePath)) {
// Check if the document supports hyperlink extraction
if (!parser.getFeatures().isHyperlinks()) {
System.out.println("Document isn't supports hyperlink extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Extract hyperlinks from the document page
Iterable<PageHyperlinkArea> hyperlinks = parser.getHyperlinks(pageIndex);
// Iterate over hyperlinks
for (PageHyperlinkArea h : hyperlinks) {
// Print the hyperlink text
System.out.println(h.getText());
// Print the hyperlink URL
System.out.println(h.getUrl());
System.out.println();
}
}
}
pageIndex
- The zero-based page index.PageHyperlinkArea
objects; null
if hyperlinks extraction isn't supported.public Iterable<PageHyperlinkArea> getHyperlinks(PageAreaOptions options)
The following example shows how to extract hyperlinks from the document page area:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.HyperlinksPdf)) {
// Check if the document supports hyperlink extraction
if (!parser.getFeatures().isHyperlinks()) {
System.out.println("Document isn't supports hyperlink extraction.");
return;
}
// Create the options which are used for hyperlink extraction
PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(380, 90), new Size(150, 50)));
// Extract hyperlinks from the document page area
Iterable<PageHyperlinkArea> hyperlinks = parser.getHyperlinks(options);
// Iterate over hyperlinks
for (PageHyperlinkArea h : hyperlinks) {
// Print the hyperlink text
System.out.println(h.getText());
// Print the hyperlink URL
System.out.println(h.getUrl());
System.out.println();
}
}
options
- The options for hyperlinks extraction.PageHyperlinkArea
objects; null
if hyperlinks extraction isn't supported.public Iterable<PageHyperlinkArea> getHyperlinks(int pageIndex, PageAreaOptions options)
The following example shows how to extract hyperlinks from the document page area using customization options:
// Create an instance of Parser class
try (Parser parser = new Parser(filePath)) {
// Check if the document supports hyperlink extraction
if (!parser.getFeatures().isHyperlinks()) {
System.out.println("Document isn't supports hyperlink extraction.");
return;
}
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Create the options which are used for hyperlink extraction
PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(380, 90), new Size(150, 50)));
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Extract hyperlinks from the document page
Iterable<PageHyperlinkArea> hyperlinks = parser.getHyperlinks(pageIndex, options);
// Iterate over hyperlinks
for (PageHyperlinkArea h : hyperlinks) {
// Print the hyperlink text
System.out.println(h.getText());
// Print the hyperlink URL
System.out.println(h.getUrl());
System.out.println();
}
}
}
pageIndex
- The zero-based page index.options
- The options for hyperlinks extraction.PageHyperlinkArea
objects; null
if hyperlinks extraction isn't supported.public Iterable<PageTableArea> getTables(PageTableAreaOptions options)
The following example shows how to extract tables from the whole document:
// Create an instance of Parser class
try (Parser parser = new Parser(filePath)) {
// Check if the document supports table extraction
if (!parser.getFeatures().isTables()) {
System.out.println("Document isn't supports tables extraction.");
return;
}
// Create the layout of tables
TemplateTableLayout layout = new TemplateTableLayout(
java.util.Arrays.asList(new Double[]{50.0, 95.0, 275.0, 415.0, 485.0, 545.0}),
java.util.Arrays.asList(new Double[]{325.0, 340.0, 365.0, 395.0}));
// Create the options for table extraction
PageTableAreaOptions options = new PageTableAreaOptions(layout);
// Extract tables from the document
Iterable<PageTableArea> tables = parser.getTables(options);
// Iterate over tables
for (PageTableArea t : tables) {
// Iterate over rows
for (int row = 0; row < t.getRowCount(); row++) {
// Iterate over columns
for (int column = 0; column < t.getColumnCount(); column++) {
// Get the table cell
PageTableAreaCell cell = t.getCell(row, column);
if (cell != null) {
// Print the table cell text
System.out.print(cell.getText());
System.out.print(" | ");
}
}
System.out.println();
}
System.out.println();
}
}
options
- The options for tables extraction.PageTableArea
objects; null
if tables extraction isn't supported.public Iterable<PageTableArea> getTables(int pageIndex, PageTableAreaOptions options)
The following example shows how to extract tables from the document page:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleInvoicePagesPdf)) {
// Check if the document supports table extraction
if (!parser.getFeatures().isTables()) {
System.out.println("Document isn't supports tables extraction.");
return;
}
// Create the layout of tables
TemplateTableLayout layout = new TemplateTableLayout(
java.util.Arrays.asList(new Double[]{50.0, 95.0, 275.0, 415.0, 485.0, 545.0}),
java.util.Arrays.asList(new Double[]{325.0, 340.0, 365.0, 395.0}));
// Create the options for table extraction
PageTableAreaOptions options = new PageTableAreaOptions(layout);
// Get the document info
IDocumentInfo documentInfo = parser.getDocumentInfo();
// Check if the document has pages
if (documentInfo.getPageCount() == 0) {
System.out.println("Document hasn't pages.");
return;
}
// Iterate over pages
for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
// Print a page number
System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
// Extract tables from the document page
Iterable<PageTableArea> tables = parser.getTables(pageIndex, options);
// Iterate over tables
for (PageTableArea t : tables) {
// Iterate over rows
for (int row = 0; row < t.getRowCount(); row++) {
// Iterate over columns
for (int column = 0; column < t.getColumnCount(); column++) {
// Get the table cell
PageTableAreaCell cell = t.getCell(row, column);
if (cell != null) {
// Print the table cell text
System.out.print(cell.getText());
System.out.print(" | ");
}
}
System.out.println();
}
System.out.println();
}
}
}
pageIndex
- The zero-based page index.options
- The options for tables extraction.PageTableArea
objects; null
if tables extraction isn't supported.public DocumentData parseByTemplate(Template template)
Learn more:
template
- The user-generated template.DocumentData
class that contains the extracted data; null
if parsing by template isn't supported.public DocumentData parseForm()
Learn more:
The following example shows how to parse a form of the document:
// Create an instance of Parser class
try (Parser parser = new Parser(Constants.SampleFormsPdf)) {
// Extract data from PDF document
DocumentData data = parser.parseForm();
// Check if form extraction is supported
if (data == null) {
System.out.println("Form extraction isn't supported.");
return;
}
// Iterate over extracted data
for (int i = 0; i < data.getCount(); i++) {
System.out.print(data.get(i).getName() + ": ");
PageTextArea area = data.get(i).getPageArea() instanceof PageTextArea
? (PageTextArea) data.get(i).getPageArea()
: null;
System.out.println(area == null ? "Not a template field" : area.getText());
}
}
DocumentData
class that contains the extracted data; null
if parsing by template isn't supported.public Document getStructure()
Learn more:
org.w3c.dom.Document
class with XML text structure; null
if text structure extraction isn't supported.public void close()
close
in interface Closeable
close
in interface AutoCloseable