com.groupdocs.parser

Class Parser

  • All Implemented Interfaces:
    Closeable, AutoCloseable


    public class Parser
    extends Object
    implements Closeable
    Represents the main class that controls text, images, container extraction and parsing functionality.
    • Constructor Detail

      • Parser

        public Parser(Connection connection)
        Initializes a new instance of the Parser class to extract data from a database.

        Learn more:

        The following example shows how to extract data from Sqlite database:

        // Create DbConnection object
         java.sql.Connection connection = java.sql.DriverManager.getConnection(String.format("jdbc:sqlite:%s", Constants.SampleDatabase));
         // Create an instance of Parser class to extract tables from the database
         try (Parser parser = new Parser(connection)) {
             // Check if text extraction is supported
             if (!parser.getFeatures().isText()) {
                 System.out.println("Text extraction isn't supported.");
                 return;
             }
             // Check if toc extraction is supported
             if (!parser.getFeatures().isToc()) {
                 System.out.println("Toc extraction isn't supported.");
                 return;
             }
             // Get a list of tables
             Iterable<TocItem> toc = parser.getToc();
             // Iterate over tables
             for(TocItem i : toc)
             {
                 // Print the table name
                 System.out.println(i.extractText());
                 // Extract a table content as a text
                 try(TextReader reader = parser.getText(i.getPageIndex().intValue()))
                 {
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        connection - The database connection.
      • Parser

        public Parser(Connection connection,
              ParserSettings parserSettings)
        Initializes a new instance of the Parser class to extract data from a database.

        Learn more:

        The following example shows how to extract data from Sqlite database:

        // Create DbConnection object
         java.sql.Connection connection = java.sql.DriverManager.getConnection(String.format("jdbc:sqlite:%s", Constants.SampleDatabase));
         // Create an instance of Parser class to extract tables from the database
         try (Parser parser = new Parser(connection)) {
             // Check if text extraction is supported
             if (!parser.getFeatures().isText()) {
                 System.out.println("Text extraction isn't supported.");
                 return;
             }
             // Check if toc extraction is supported
             if (!parser.getFeatures().isToc()) {
                 System.out.println("Toc extraction isn't supported.");
                 return;
             }
             // Get a list of tables
             Iterable<TocItem> toc = parser.getToc();
             // Iterate over tables
             for(TocItem i : toc)
             {
                 // Print the table name
                 System.out.println(i.extractText());
                 // Extract a table content as a text
                 try(TextReader reader = parser.getText(i.getPageIndex().intValue()))
                 {
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        connection - The database connection.
        parserSettings - The parser settings which are used to customize data extraction.
      • Parser

        public Parser(EmailConnection connection)
        Initializes a new instance of the Parser class.

        Learn more:

        The following example shows how to extract emails from Exchange Server:

        // Create the connection object for Exchange Web Services protocol
         EmailConnection connection = new EmailEwsConnection(
                 "https://outlook.office365.com/ews/exchange.asmx",
                 "email@server",
                 "password");
         // Create an instance of Parser class to extract emails from the remote server
         try (Parser parser = new Parser(connection)) {
             // Check if container extraction is supported
             if (!parser.getFeatures().isContainer()) {
                 System.out.println("Container extraction isn't supported.");
                 return;
             }
             // Extract email messages from the server
             Iterable<ContainerItem> emails = parser.getContainer();
             // Iterate over attachments
             for (ContainerItem item : emails) {
                 // Create an instance of Parser class for email message
                 try (Parser emailParser = item.openParser()) {
                     // Extract the email text
                     try (TextReader reader = emailParser.getText()) {
                         // Print the email text
                         System.out.println(reader == null ? "Text extraction isn't supported." : reader.readToEnd());
                     }
                 }
             }
         }
         
        Parameters:
        connection - The email connection.
      • Parser

        public Parser(EmailConnection connection,
              ParserSettings parserSettings)
        Initializes a new instance of the Parser class.

        Learn more:

        The following example shows how to extract emails from Exchange Server:

        // Create the connection object for Exchange Web Services protocol
         EmailConnection connection = new EmailEwsConnection(
                 "https://outlook.office365.com/ews/exchange.asmx",
                 "email@server",
                 "password");
         // Create an instance of Parser class to extract emails from the remote server
         try (Parser parser = new Parser(connection)) {
             // Check if container extraction is supported
             if (!parser.getFeatures().isContainer()) {
                 System.out.println("Container extraction isn't supported.");
                 return;
             }
             // Extract email messages from the server
             Iterable<ContainerItem> emails = parser.getContainer();
             // Iterate over attachments
             for (ContainerItem item : emails) {
                 // Create an instance of Parser class for email message
                 try (Parser emailParser = item.openParser()) {
                     // Extract the email text
                     try (TextReader reader = emailParser.getText()) {
                         // Print the email text
                         System.out.println(reader == null ? "Text extraction isn't supported." : reader.readToEnd());
                     }
                 }
             }
         }
         
        Parameters:
        connection - The email connection.
        parserSettings - The parser settings which are used to customize data extraction.
      • Parser

        public Parser(String filePath)
        Initializes a new instance of the Parser class.

        Learn more:

        The following example shows how to load the document from the local disk:

        // Set the filePath
         String filePath = Constants.SamplePdf;
         // Create an instance of Parser class with the filePath
         try (Parser parser = new Parser(filePath)) {
             // Extract a text into the reader
             try (TextReader reader = parser.getText()) {
                 // Print a text from the document
                 // If text extraction isn't supported, a reader is null
                 System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
             }
         }
         
        Parameters:
        filePath - The path to the file.
      • Parser

        public Parser(String filePath,
              LoadOptions loadOptions)
        Initializes a new instance of the Parser class with LoadOptions.

        Learn more:

        The document password is passed by LoadOptions class:

        try {
             String password = "123456";
             // Create an instance of Parser class with the password:
             try (Parser parser = new Parser(Constants.SamplePassword, new LoadOptions(password))) {
                 // Check if text extraction is supported
                 if (!parser.getFeatures().isText()) {
                     System.out.println("Text extraction isn't supported.");
                     return;
                 }
                 // Print the document text
                 try (TextReader reader = parser.getText()) {
                     System.out.println(reader.readToEnd());
                 }
             }
         } catch (InvalidPasswordException ex) {
             // Print the message if the password is incorrect or empty
             System.out.println("Invalid password");
         }
         
        Parameters:
        filePath - The path to the file.
        loadOptions - The options to open the file.
      • Parser

        public Parser(String filePath,
              LoadOptions loadOptions,
              ParserSettings parserSettings)
        Initializes a new instance of the Parser class with LoadOptions. and ParserSettings

        Learn more:

        The following example shows how to receive the information via ILogger interface:

        try {
             // Create an instance of Logger class
             Logger logger = new Logger();
             // Create an instance of Parser class with the parser settings
             try (Parser parser = new Parser(Constants.SamplePassword, null, new ParserSettings(logger))) {
                 // Check if text extraction is supported
                 if (!parser.getFeatures().isText()) {
                     System.out.println("Text extraction isn't supported.");
                     return;
                 }
                 // Print the document text
                 try (TextReader reader = parser.getText()) {
                     System.out.println(reader.readToEnd());
                 }
             }
         } catch (InvalidPasswordException | IOException ex) {
             ; // Ignore the exception
         }
        
         class Logger implements ILogger {
             public void error(String message, Exception exception) {
                 // Print error message
                 System.out.println("Error: " + message);
             }
        
             public void trace(String message) {
                 // Print event message
                 System.out.println("Event: " + message);
             }
        
             public void warning(String message) {
                 // Print warning message
                 System.out.println("Warning: " + message);
             }
         }
         
        Parameters:
        filePath - The path to the file.
        loadOptions - The options to open the file.
        parserSettings - The parser settings which are used to customize data extraction.
      • Parser

        public Parser(InputStream document)
        Initializes a new instance of the Parser class.

        Learn more:

        The following example shows how to load the document from the stream:

        // Create the stream
         try (InputStream stream = new FileInputStream(Constants.SamplePdf)) {
             // Create an instance of Parser class with the stream
             try (Parser parser = new Parser(stream)) {
                 // Extract a text into the reader
                 try (TextReader reader = parser.getText()) {
                     // Print a text from the document
                     // If text extraction isn't supported, a reader is null
                     System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        document - The source input stream.
      • Parser

        public Parser(InputStream document,
              LoadOptions loadOptions)
        Initializes a new instance of the Parser class with LoadOptions.

        Learn more:

        In some cases it's necessary to define FileFormat. Both for special cases (databases, email server) and for detecting file types by the content:

        // Create an instance of Parser class for markdown document
         try (Parser parser = new Parser(stream, new LoadOptions(FileFormat.Markup))) {
             // Check if text extraction is supported
             if (!parser.getFeatures().isText()) {
                 System.out.println("Text extraction isn't supported.");
                 return;
             }
             try (TextReader reader = parser.getText()) {
                 // Print the document text
                 // Markdown is detected; text without special symbols is printed
                 System.out.println(reader.readToEnd());
             }
         }
         
        Parameters:
        document - The source input stream.
        loadOptions - The options to open the file.
      • Parser

        public Parser(InputStream document,
              LoadOptions loadOptions,
              ParserSettings parserSettings)
        Initializes a new instance of the Parser class with LoadOptions. and ParserSettings

        Learn more:

        The following example shows how to receive the information via ILogger interface:

        try {
             // Create an instance of Logger class
             Logger logger = new Logger();
             // Create an instance of Parser class with the parser settings
             try (Parser parser = new Parser(Constants.SamplePassword, null, new ParserSettings(logger))) {
                 // Check if text extraction is supported
                 if (!parser.getFeatures().isText()) {
                     System.out.println("Text extraction isn't supported.");
                     return;
                 }
                 // Print the document text
                 try (TextReader reader = parser.getText()) {
                     System.out.println(reader.readToEnd());
                 }
             }
         } catch (InvalidPasswordException | IOException ex) {
             ; // Ignore the exception
         }
        
         class Logger implements ILogger {
             public void error(String message, Exception exception) {
                 // Print error message
                 System.out.println("Error: " + message);
             }
        
             public void trace(String message) {
                 // Print event message
                 System.out.println("Event: " + message);
             }
        
             public void warning(String message) {
                 // Print warning message
                 System.out.println("Warning: " + message);
             }
         }
         
        Parameters:
        document - The source input stream.
        loadOptions - The options to open the file.
        parserSettings - The parser settings which are used to customize data extraction.
    • Method Detail

      • getFeatures

        public Features getFeatures()
        Gets the supported features.

        Learn more:

        If the feature isn't supported, the method returns null instead of the value. Some operations may consume significant time. So it's not optimal to call the method to just check the support for the feature. For this purpose Features property is used:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleZip)) {
             // Check if text extraction is supported for the document
             if (!parser.getFeatures().isText()) {
                 System.out.println("Text extraction isn't supported");
                 return;
             }
             // Extract a text from the document
             try (TextReader reader = parser.getText()) {
                 System.out.println(reader.readToEnd());
             }
         }
         
        Returns:
        An instance of Features class that represents the supported features.
      • generatePreview

        public void generatePreview(PreviewOptions previewOptions)
        Get pages preview.
        Parameters:
        previewOptions - The options to sets requirements and stream delegates for preview generation.
      • getDocumentInfo

        public IDocumentInfo getDocumentInfo()
        Returns the general information about the document.

        Learn more:

        The following example shows how to get document info:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleDocx)) {
             // Get the document info
             IDocumentInfo info = parser.getDocumentInfo();
             // Print document information
             System.out.println(String.format("FileType: %s", info.getFileType()));
             System.out.println(String.format("PageCount: %d", info.getPageCount()));
             System.out.println(String.format("Size: %d", info.getSize()));
         }
         
        Returns:
        An instance of class that implements IDocumentInfo interface.
      • getText

        public TextReader getText()
        Extracts a text from the document.

        Learn more:

        The following example shows how to extract a text from a document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Extract a text into the reader
             try (TextReader reader = parser.getText()) {
                 // Print a text from the document
                 // If text extraction isn't supported, a reader is null
                 System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
             }
         }
         
        Returns:
        An instance of TextReader class with the extracted text; null if text extraction isn't supported.
      • getText

        public TextReader getText(TextOptions options)
        Extracts a text page from the document using text options (to enable raw fast text extraction mode).

        Learn more:

        The following example shows how to extract a raw text from a document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Extract a raw text into the reader
             try (TextReader reader = parser.getText(new TextOptions(true))) {
                 // Print a text from the document
                 // If text extraction isn't supported, a reader is null
                 System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
             }
         }
         
        Parameters:
        options - The text extraction options.
        Returns:
        An instance of TextReader class with the extracted text; null if text extraction isn't supported.
      • getText

        public TextReader getText(int pageIndex)
        Extracts a text from the document page.

        Learn more:

        The following example shows how to extract a text from the document page:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Check if the document supports text extraction
             if (!parser.getFeatures().isText()) {
                 System.out.println("Document isn't supports text extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int p = 0; p < documentInfo.getPageCount(); p++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
                 // Extract a text into the reader
                 try (TextReader reader = parser.getText(p)) {
                     // Print a text from the document
                     // We ignore null-checking as we have checked text extraction feature support earlier
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        Returns:
        An instance of TextReader class with the extracted text; null if text page extraction isn't supported.
      • getText

        public TextReader getText(int pageIndex,
                         TextOptions options)
        Extracts a text from the document page using text options (to enable raw fast text extraction mode).

        Learn more:

        The following example shows how to extract a raw text from the document page:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Check if the document supports text extraction
             if (!parser.getFeatures().isText()) {
                 System.out.println("Document isn't supports text extraction.");
                 return;
             }
             // Get the document info
             DocumentInfo documentInfo = parser.getDocumentInfo() instanceof DocumentInfo
                     ? (DocumentInfo) parser.getDocumentInfo()
                     : null;
             // Check if the document has pages
             if (documentInfo == null || documentInfo.getRawPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int p = 0; p < documentInfo.getRawPageCount(); p++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
                 // Extract a text into the reader
                 try (TextReader reader = parser.getText(p, new TextOptions(true))) {
                     // Print a text from the document
                     // We ignore null-checking as we have checked text extraction feature support earlier
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        options - The text extraction options.
        Returns:
        An instance of TextReader class with the extracted text; null if text page extraction isn't supported.
      • getFormattedText

        public TextReader getFormattedText(FormattedTextOptions options)
        Extracts a formatted text from the document.

        Learn more:

        The following example shows how to extract a document text as HTML text:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleDocx)) {
             // Extract a formatted text into the reader
             try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
                 // Print a formatted text from the document
                 // If formatted text extraction isn't supported, a reader is null
                 System.out.println(reader == null ? "Formatted text extraction isn't suppported" : reader.readToEnd());
             }
         }
         
        Parameters:
        options - The formatted text extraction options.
        Returns:
        An instance of TextReader class with the extracted text; null if formatted text extraction isn't supported.
      • getFormattedText

        public TextReader getFormattedText(int pageIndex,
                                  FormattedTextOptions options)
        Extracts a formatted text from the document page.

        Learn more:

        The following example shows how to extract a document page text as Markdown text:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleDocx)) {
             // Check if the document supports formatted text extraction
             if (!parser.getFeatures().isFormattedText()) {
                 System.out.println("Document isn't supports formatted text extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int p = 0; p < documentInfo.getPageCount(); p++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
                 // Extract a formatted text into the reader
                 try (TextReader reader = parser.getFormattedText(p, new FormattedTextOptions(FormattedTextMode.Markdown))) {
                     // Print a formatted text from the document
                     // We ignore null-checking as we have checked formatted text extraction feature support earlier
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        options - The formatted text extraction options.
        Returns:
        An instance of TextReader class with the extracted text; null if formatted text page extraction isn't supported.
      • search

        public Iterable<SearchResult> search(String keyword,
                                    SearchOptions options)
        Searches a keyword in the document using search options (regular expression, match case, etc.).

        Learn more:

        The following example shows how to search with a regular expression in a document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Search with a regular expression with case matching
             Iterable<SearchResult> sr = parser.search("[0-9]+", new SearchOptions(true, false, true));
             // Check if search is supported
             if (sr == null) {
                 System.out.println("Search isn't supported");
                 return;
             }
             // Iterate over search results
             for (SearchResult s : sr) {
                 // Print an index and found text:
                 System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
             }
         }
         

        The following example shows how to search a text on pages:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Search a keyword with page numbers
             Iterable<SearchResult> sr = parser.search("lorem", new SearchOptions(false, false, false, true));
             // Check if search is supported
             if (sr == null) {
                 System.out.println("Search isn't supported");
                 return;
             }
             // Iterate over search results
             for (SearchResult s : sr) {
                 // Print an index, page number and found text:
                 System.out.println(String.format("At %d (%d): %s", s.getPosition(), s.getPageIndex(), s.getText()));
             }
         }
         
        Parameters:
        keyword - The keyword to search.
        options - The search options.
        Returns:
        A collection of SearchResult objects; null if search isn't supported.
      • getHighlight

        public HighlightItem getHighlight(int position,
                                 boolean isDirect,
                                 HighlightOptions options)
        Extracts a highlight from the document.

        Learn more:

        The following example shows how to extract a highlight that contains 3 words:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Extract a highlight:
             HighlightItem hl = parser.getHighlight(2, true, new HighlightOptions(10, 3));
             // Check if highlight extraction is supported
             if (hl == null) {
                 System.out.println("Highlight extraction isn't supported");
                 return;
             }
             // Print an extracted highlight
             System.out.println(String.format("At %d: %s", hl.getPosition(), hl.getText()));
         }
         
        Parameters:
        position - The start position of the highlight.
        isDirect - The value that indicates whether highlight extraction is direct. true if the higlight is extracted by the right of position; otherwise, false.
        options - The highlight extraction options.
        Returns:
        An instance of HighlightItem class that represents the extracted highlight; null if highlight extraction isn't supported.
      • getToc

        public Iterable<TocItem> getToc()
        Extracts a table of contents from the document.

        Learn more:

        The following example shows how to extract table of contents from EPUB file:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleEpub)) {
             // Check if text extraction is supported
             if (!parser.getFeatures().isText()) {
                 System.out.println("Text extraction isn't supported.");
                 return;
             }
             // Check if toc extraction is supported
             if (!parser.getFeatures().isToc()) {
                 System.out.println("Toc extraction isn't supported.");
                 return;
             }
             // Get table of contents
             Iterable<TocItem> toc = parser.getToc();
             // Iterate over items
             for (TocItem i : toc) {
                 // Print the Toc text
                 System.out.println(i.getText());
                 // Check if page index has a value
                 if (i.getPageIndex() == null) {
                     continue;
                 }
                 // Extract a page text
                 try (TextReader reader = parser.getText(i.getPageIndex())) {
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Returns:
        A collection of table of contents items; null if table of contents extraction isn't supported.
      • getContainer

        public Iterable<ContainerItem> getContainer()
        Extracts a container object from the document to work with formats that contain attachments, ZIP archives etc.

        Learn more:

        The following example shows how to extract a text from zip entities:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleZip)) {
             // Extract attachments from the container
             Iterable<ContainerItem> attachments = parser.getContainer();
             // Check if container extraction is supported
             if (attachments == null) {
                 System.out.println("Container extraction isn't supported");
             }
             // Iterate over zip entities
             for (ContainerItem item : attachments) {
                 // Print the file path
                 System.out.println(item.getFilePath());
                 try {
                     // Create Parser object for the zip entity content
                     try (Parser attachmentParser = item.openParser()) {
                         // Extract an zip entity text
                         try (TextReader reader = attachmentParser.getText()) {
                             System.out.println(reader == null ? "No text" : reader.readToEnd());
                         }
                     }
                 } catch (UnsupportedDocumentFormatException ex) {
                     System.out.println("Isn't supported.");
                 }
             }
         }
         
        Returns:
        A collection of container items; null if container extraction isn't supported.
      • getTextAreas

        public Iterable<PageTextArea> getTextAreas()
        Extracts text areas from the document.

        Learn more:

        The following example shows how to extract all text areas from the whole document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Extract text areas
             Iterable<PageTextArea> areas = parser.getTextAreas();
             // Check if text areas extraction is supported
             if (areas == null) {
                 System.out.println("Page text areas extraction isn't supported");
                 return;
             }
             // Iterate over page text areas
             for (PageTextArea a : areas) {
                 // Print a page index, rectangle and text area value:
                 System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
             }
         }
         
        Returns:
        A collection of PageTextArea objects; null if text areas extraction isn't supported.
      • getTextAreas

        public Iterable<PageTextArea> getTextAreas(PageTextAreaOptions options)
        Extracts text areas from the document using customization options (regular expression, match case, etc.).

        Learn more:

        The following example shows how to extract only text areas with digits from the upper-left courner:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Create the options which are used for text area extraction
             PageTextAreaOptions options = new PageTextAreaOptions("\\s[a-z]{2}\\s", new Rectangle(new Point(0, 0), new Size(300, 100)));
             // Extract text areas which contain only digits from the upper-left corner of a page:
             Iterable<PageTextArea> areas = parser.getTextAreas(options);
             // Check if text areas extraction is supported
             if (areas == null) {
                 System.out.println("Page text areas extraction isn't supported");
                 return;
             }
             // Iterate over page text areas
             for (PageTextArea a : areas) {
                 // Print a page index, rectangle and text area value:
                 System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
             }
         }
         
        Parameters:
        options - The options for text area extraction.
        Returns:
        A collection of PageTextArea objects; null if text areas extraction isn't supported.
      • getTextAreas

        public Iterable<PageTextArea> getTextAreas(int pageIndex)
        Extracts text areas from the document page.

        Learn more:

        To extract text areas from a document page the following method is used:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Check if the document supports text areas extraction
             if (!parser.getFeatures().isTextAreas()) {
                 System.out.println("Document isn't supports text areas extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
                 // Iterate over page text areas
                 // We ignore null-checking as we have checked text areas extraction feature support earlier
                 for (PageTextArea a : parser.getTextAreas(pageIndex)) {
                     // Print a rectangle and text area value:
                     System.out.println(String.format("R: %s, Text: %s", a.getRectangle(), a.getText()));
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        Returns:
        A collection of PageTextArea objects; null if text areas extraction isn't supported.
      • getTextAreas

        public Iterable<PageTextArea> getTextAreas(int pageIndex,
                                          PageTextAreaOptions options)
        Extracts text areas from the document page using customization options (regular expression, match case, etc.).

        Learn more:

        Parameters:
        pageIndex - The zero-based page index.
        options - The options for text area extraction.
        Returns:
        A collection of PageTextArea objects; null if text areas extraction isn't supported.
      • getHyperlinks

        public Iterable<PageHyperlinkArea> getHyperlinks()
        Extracts hyperlinks from the document.

        The following example shows how to extract all hyperlinks from the whole document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(filePath)) {
             // Check if the document supports hyperlink extraction
             if (!parser.getFeatures().isHyperlinks()) {
                 System.out.println("Document isn't supports hyperlink extraction.");
                 return;
             }
             // Extract hyperlinks from the document
             Iterable<PageHyperlinkArea> hyperlinks = parser.getHyperlinks();
             // Iterate over hyperlinks
             for (PageHyperlinkArea h : hyperlinks) {
                 // Print the hyperlink text
                 System.out.println(h.getText());
                 // Print the hyperlink URL
                 System.out.println(h.getUrl());
                 System.out.println();
             }
         }
         
        Returns:
        A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn't supported.
      • getHyperlinks

        public Iterable<PageHyperlinkArea> getHyperlinks(int pageIndex)
        Extracts hyperlinks from the document page.

        The following example shows how to extract hyperlinks from the document page:

        // Create an instance of Parser class
         try (Parser parser = new Parser(filePath)) {
             // Check if the document supports hyperlink extraction
             if (!parser.getFeatures().isHyperlinks()) {
                 System.out.println("Document isn't supports hyperlink extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
                 // Extract hyperlinks from the document page
                 Iterable<PageHyperlinkArea> hyperlinks = parser.getHyperlinks(pageIndex);
                 // Iterate over hyperlinks
                 for (PageHyperlinkArea h : hyperlinks) {
                     // Print the hyperlink text
                     System.out.println(h.getText());
                     // Print the hyperlink URL
                     System.out.println(h.getUrl());
                     System.out.println();
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        Returns:
        A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn't supported.
      • getHyperlinks

        public Iterable<PageHyperlinkArea> getHyperlinks(PageAreaOptions options)
        Extracts hyperlinks from the document using customization options (to set the rectangular area that contains hyperlinks).

        The following example shows how to extract hyperlinks from the document page area:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.HyperlinksPdf)) {
             // Check if the document supports hyperlink extraction
             if (!parser.getFeatures().isHyperlinks()) {
                 System.out.println("Document isn't supports hyperlink extraction.");
                 return;
             }
             // Create the options which are used for hyperlink extraction
             PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(380, 90), new Size(150, 50)));
             // Extract hyperlinks from the document page area
             Iterable<PageHyperlinkArea> hyperlinks = parser.getHyperlinks(options);
             // Iterate over hyperlinks
             for (PageHyperlinkArea h : hyperlinks) {
                 // Print the hyperlink text
                 System.out.println(h.getText());
                 // Print the hyperlink URL
                 System.out.println(h.getUrl());
                 System.out.println();
             }
         }
         
        Parameters:
        options - The options for hyperlinks extraction.
        Returns:
        A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn't supported.
      • getHyperlinks

        public Iterable<PageHyperlinkArea> getHyperlinks(int pageIndex,
                                                PageAreaOptions options)
        Extracts hyperlinks from the document page using customization options (to set the rectangular area that contains hyperlinks).

        The following example shows how to extract hyperlinks from the document page area using customization options:

        // Create an instance of Parser class
         try (Parser parser = new Parser(filePath)) {
             // Check if the document supports hyperlink extraction
             if (!parser.getFeatures().isHyperlinks()) {
                 System.out.println("Document isn't supports hyperlink extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Create the options which are used for hyperlink extraction
             PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(380, 90), new Size(150, 50)));
             // Iterate over pages
             for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
                 // Extract hyperlinks from the document page
                 Iterable<PageHyperlinkArea> hyperlinks = parser.getHyperlinks(pageIndex, options);
                 // Iterate over hyperlinks
                 for (PageHyperlinkArea h : hyperlinks) {
                     // Print the hyperlink text
                     System.out.println(h.getText());
                     // Print the hyperlink URL
                     System.out.println(h.getUrl());
                     System.out.println();
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        options - The options for hyperlinks extraction.
        Returns:
        A collection of PageHyperlinkArea objects; null if hyperlinks extraction isn't supported.
      • getTables

        public Iterable<PageTableArea> getTables(PageTableAreaOptions options)
        Extracts tables from the document.

        The following example shows how to extract tables from the whole document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(filePath)) {
             // Check if the document supports table extraction
             if (!parser.getFeatures().isTables()) {
                 System.out.println("Document isn't supports tables extraction.");
                 return;
             }
             // Create the layout of tables
             TemplateTableLayout layout = new TemplateTableLayout(
                     java.util.Arrays.asList(new Double[]{50.0, 95.0, 275.0, 415.0, 485.0, 545.0}),
                     java.util.Arrays.asList(new Double[]{325.0, 340.0, 365.0, 395.0}));
             // Create the options for table extraction
             PageTableAreaOptions options = new PageTableAreaOptions(layout);
             // Extract tables from the document
             Iterable<PageTableArea> tables = parser.getTables(options);
             // Iterate over tables
             for (PageTableArea t : tables) {
                 // Iterate over rows
                 for (int row = 0; row < t.getRowCount(); row++) {
                     // Iterate over columns
                     for (int column = 0; column < t.getColumnCount(); column++) {
                         // Get the table cell
                         PageTableAreaCell cell = t.getCell(row, column);
                         if (cell != null) {
                             // Print the table cell text
                             System.out.print(cell.getText());
                             System.out.print(" | ");
                         }
                     }
                     System.out.println();
                 }
                 System.out.println();
             }
         }
         
        Parameters:
        options - The options for tables extraction.
        Returns:
        A collection of PageTableArea objects; null if tables extraction isn't supported.
      • getTables

        public Iterable<PageTableArea> getTables(int pageIndex,
                                        PageTableAreaOptions options)
        Extracts tables from the document page.

        The following example shows how to extract tables from the document page:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleInvoicePagesPdf)) {
             // Check if the document supports table extraction
             if (!parser.getFeatures().isTables()) {
                 System.out.println("Document isn't supports tables extraction.");
                 return;
             }
             // Create the layout of tables
             TemplateTableLayout layout = new TemplateTableLayout(
                     java.util.Arrays.asList(new Double[]{50.0, 95.0, 275.0, 415.0, 485.0, 545.0}),
                     java.util.Arrays.asList(new Double[]{325.0, 340.0, 365.0, 395.0}));
             // Create the options for table extraction
             PageTableAreaOptions options = new PageTableAreaOptions(layout);
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
                 // Extract tables from the document page
                 Iterable<PageTableArea> tables = parser.getTables(pageIndex, options);
                 // Iterate over tables
                 for (PageTableArea t : tables) {
                     // Iterate over rows
                     for (int row = 0; row < t.getRowCount(); row++) {
                         // Iterate over columns
                         for (int column = 0; column < t.getColumnCount(); column++) {
                             // Get the table cell
                             PageTableAreaCell cell = t.getCell(row, column);
                             if (cell != null) {
                                 // Print the table cell text
                                 System.out.print(cell.getText());
                                 System.out.print(" | ");
                             }
                         }
                         System.out.println();
                     }
                     System.out.println();
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        options - The options for tables extraction.
        Returns:
        A collection of PageTableArea objects; null if tables extraction isn't supported.
      • parseForm

        public DocumentData parseForm()
        Parses the document form.

        Learn more:

        The following example shows how to parse a form of the document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleFormsPdf)) {
             // Extract data from PDF document
             DocumentData data = parser.parseForm();
             // Check if form extraction is supported
             if (data == null) {
                 System.out.println("Form extraction isn't supported.");
                 return;
             }
             // Iterate over extracted data
             for (int i = 0; i < data.getCount(); i++) {
                 System.out.print(data.get(i).getName() + ": ");
                 PageTextArea area = data.get(i).getPageArea() instanceof PageTextArea
                         ? (PageTextArea) data.get(i).getPageArea()
                         : null;
                 System.out.println(area == null ? "Not a template field" : area.getText());
             }
         }
         
        Returns:
        An instance of DocumentData class that contains the extracted data; null if parsing by template isn't supported.
      • getStructure

        public Document getStructure()
        Extracts a structured text from the document.

        Learn more:

        Returns:
        An instance of org.w3c.dom.Document class with XML text structure; null if text structure extraction isn't supported.
      • close

        public void close()
        Closes this resource, relinquishing any underlying resources.
        Specified by:
        close in interface Closeable
        Specified by:
        close in interface AutoCloseable