com.groupdocs.parser

Class Parser

  • All Implemented Interfaces:
    Closeable, AutoCloseable


    public class Parser
    extends Object
    implements Closeable
    Represents the main class that controls text, images, container extraction and parsing functionality.
    • Constructor Detail

      • Parser

        public Parser(String filePath)
        Initializes a new instance of the Parser class.

        Usage:

        // Set the filePath
         String filePath = Constants.SamplePdf;
         // Create an instance of Parser class with the filePath
         try (Parser parser = new Parser(filePath)) {
             // Extract a text into the reader
             try (TextReader reader = parser.getText()) {
                 // Print a text from the document
                 // If text extraction isn't supported, a reader is null
                 System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
             }
         }
         
        Parameters:
        filePath - The path to the file.
      • Parser

        public Parser(String filePath,
              LoadOptions loadOptions)
        Initializes a new instance of the Parser class with LoadOptions.

        The document password is passed by LoadOptions class:

        try {
             String password = "123456";
             // Create an instance of Parser class with the password:
             try (Parser parser = new Parser(Constants.SamplePassword, new LoadOptions(password))) {
                 // Check if text extraction is supported
                 if (!parser.getFeatures().isText()) {
                     System.out.println("Text extraction isn't supported.");
                     return;
                 }
                 // Print the document text
                 try (TextReader reader = parser.getText()) {
                     System.out.println(reader.readToEnd());
                 }
             }
         } catch (InvalidPasswordException ex) {
             // Print the message if the password is incorrect or empty
             System.out.println("Invalid password");
         }
         
        Parameters:
        filePath - The path to the file.
        loadOptions - The options to open the file.
      • Parser

        public Parser(String filePath,
              LoadOptions loadOptions,
              ParserSettings parserSettings)
        Initializes a new instance of the Parser class with LoadOptions. and ParserSettings
        Parameters:
        filePath - The path to the file.
        loadOptions - The options to open the file.
        parserSettings - The parser settings which are used to customize data extraction.
      • Parser

        public Parser(InputStream document)
        Initializes a new instance of the Parser class.

        Usage:

        // Create the stream
         try (InputStream stream = new FileInputStream(Constants.SamplePdf)) {
             // Create an instance of Parser class with the stream
             try (Parser parser = new Parser(stream)) {
                 // Extract a text into the reader
                 try (TextReader reader = parser.getText()) {
                     // Print a text from the document
                     // If text extraction isn't supported, a reader is null
                     System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        document - The source input stream.
      • Parser

        public Parser(InputStream document,
              LoadOptions loadOptions)
        Initializes a new instance of the Parser class with LoadOptions.

        In some cases it's necessary to define FileFormat. Both for special cases (databases, email server) and for detecting file types by the content:

        // Create an instance of Parser class for markdown document
         try (Parser parser = new Parser(stream, new LoadOptions(FileFormat.Markup))) {
             // Check if text extraction is supported
             if (!parser.getFeatures().isText()) {
                 System.out.println("Text extraction isn't supported.");
                 return;
             }
             try (TextReader reader = parser.getText()) {
                 // Print the document text
                 // Markdown is detected; text without special symbols is printed
                 System.out.println(reader.readToEnd());
             }
         }
         
        Parameters:
        document - The source input stream.
        loadOptions - The options to open the file.
      • Parser

        public Parser(InputStream document,
              LoadOptions loadOptions,
              ParserSettings parserSettings)
        Initializes a new instance of the Parser class with LoadOptions. and ParserSettings
        Parameters:
        document - The source input stream.
        loadOptions - The options to open the file.
        parserSettings - The parser settings which are used to customize data extraction.
    • Method Detail

      • getFeatures

        public Features getFeatures()
        Gets the supported features.

        If the feature isn't supported, the method returns null instead of the value. Some operations may consume significant time. So it's not optimal to call the method to just check the support for the feature. For this purpose Features property is used:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleZip)) {
             // Check if text extraction is supported for the document
             if (!parser.getFeatures().isText()) {
                 System.out.println("Text extraction isn't supported");
                 return;
             }
             // Extract a text from the document
             try (TextReader reader = parser.getText()) {
                 System.out.println(reader.readToEnd());
             }
         }
         
        Returns:
        An instance of Features class that represents the supported features.
      • generatePreview

        public void generatePreview(PreviewOptions previewOptions)
        Get pages preview.
        Parameters:
        previewOptions - The options to sets requirements and stream delegates for preview generation.
      • getDocumentInfo

        public IDocumentInfo getDocumentInfo()
        Returns the general information about the document.

        Usage:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleDocx)) {
             // Get the document info
             IDocumentInfo info = parser.getDocumentInfo();
             // Print document information
             System.out.println(String.format("FileType: %s", info.getFileType()));
             System.out.println(String.format("PageCount: %d", info.getPageCount()));
             System.out.println(String.format("Size: %d", info.getSize()));
         }
         
        Returns:
        An instance of class that implements IDocumentInfo interface.
      • getText

        public TextReader getText()
        Extracts a text from the document.

        The following example shows how to extract a text from a document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Extract a text into the reader
             try (TextReader reader = parser.getText()) {
                 // Print a text from the document
                 // If text extraction isn't supported, a reader is null
                 System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
             }
         }
         
        Returns:
        An instance of TextReader class with the extracted text; null if text extraction isn't supported.
      • getText

        public TextReader getText(TextOptions options)
        Extracts a text page from the document using text options (to enable raw fast text extraction mode).

        The following example shows how to extract a raw text from a document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Extract a raw text into the reader
             try (TextReader reader = parser.getText(new TextOptions(true))) {
                 // Print a text from the document
                 // If text extraction isn't supported, a reader is null
                 System.out.println(reader == null ? "Text extraction isn't supported" : reader.readToEnd());
             }
         }
         
        Parameters:
        options - The text extraction options.
        Returns:
        An instance of TextReader class with the extracted text; null if text extraction isn't supported.
      • getText

        public TextReader getText(int pageIndex)
        Extracts a text from the document page.

        The following example shows how to extract a text from the document page:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Check if the document supports text extraction
             if (!parser.getFeatures().isText()) {
                 System.out.println("Document isn't supports text extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int p = 0; p < documentInfo.getPageCount(); p++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
                 // Extract a text into the reader
                 try (TextReader reader = parser.getText(p)) {
                     // Print a text from the document
                     // We ignore null-checking as we have checked text extraction feature support earlier
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        Returns:
        An instance of TextReader class with the extracted text; null if text page extraction isn't supported.
      • getText

        public TextReader getText(int pageIndex,
                         TextOptions options)
        Extracts a text from the document page using text options (to enable raw fast text extraction mode).

        The following example shows how to extract a raw text from the document page:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Check if the document supports text extraction
             if (!parser.getFeatures().isText()) {
                 System.out.println("Document isn't supports text extraction.");
                 return;
             }
             // Get the document info
             DocumentInfo documentInfo = parser.getDocumentInfo() instanceof DocumentInfo
                     ? (DocumentInfo) parser.getDocumentInfo()
                     : null;
             // Check if the document has pages
             if (documentInfo == null || documentInfo.getRawPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int p = 0; p < documentInfo.getRawPageCount(); p++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
                 // Extract a text into the reader
                 try (TextReader reader = parser.getText(p, new TextOptions(true))) {
                     // Print a text from the document
                     // We ignore null-checking as we have checked text extraction feature support earlier
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        options - The text extraction options.
        Returns:
        An instance of TextReader class with the extracted text; null if text page extraction isn't supported.
      • getFormattedText

        public TextReader getFormattedText(FormattedTextOptions options)
        Extracts a formatted text from the document.

        The following example shows how to extract a document text as HTML text:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleDocx)) {
             // Extract a formatted text into the reader
             try (TextReader reader = parser.getFormattedText(new FormattedTextOptions(FormattedTextMode.Html))) {
                 // Print a formatted text from the document
                 // If formatted text extraction isn't supported, a reader is null
                 System.out.println(reader == null ? "Formatted text extraction isn't suppported" : reader.readToEnd());
             }
         }
         
        Parameters:
        options - The formatted text extraction options.
        Returns:
        An instance of TextReader class with the extracted text; null if formatted text extraction isn't supported.
      • getFormattedText

        public TextReader getFormattedText(int pageIndex,
                                  FormattedTextOptions options)
        Extracts a formatted text from the document page.

        The following example shows how to extract a document page text as Markdown text:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleDocx)) {
             // Check if the document supports formatted text extraction
             if (!parser.getFeatures().isFormattedText()) {
                 System.out.println("Document isn't supports formatted text extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int p = 0; p < documentInfo.getPageCount(); p++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", p + 1, documentInfo.getPageCount()));
                 // Extract a formatted text into the reader
                 try (TextReader reader = parser.getFormattedText(p, new FormattedTextOptions(FormattedTextMode.Markdown))) {
                     // Print a formatted text from the document
                     // We ignore null-checking as we have checked formatted text extraction feature support earlier
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        options - The formatted text extraction options.
        Returns:
        An instance of TextReader class with the extracted text; null if formatted text page extraction isn't supported.
      • search

        public Iterable<SearchResult> search(String keyword)
        Searches a keyword in the document.

        The following example shows how to find a keyword in a document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Search a keyword:
             Iterable<SearchResult> sr = parser.search("lorem");
             // Check if search is supported
             if (sr == null) {
                 System.out.println("Search isn't supported");
                 return;
             }
             // Iterate over search results
             for (SearchResult s : sr) {
                 // Print an index and found text:
                 System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
             }
         }
         
        Parameters:
        keyword - The keyword to search.
        Returns:
        A collection of SearchResult objects; null if search isn't supported.
      • search

        public Iterable<SearchResult> search(String keyword,
                                    SearchOptions options)
        Searches a keyword in the document using search options (regular expression, match case, etc.).

        The following example shows how to search with a regular expression in a document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Search with a regular expression with case matching
             Iterable<SearchResult> sr = parser.search("[0-9]+", new SearchOptions(true, false, true));
             // Check if search is supported
             if (sr == null) {
                 System.out.println("Search isn't supported");
                 return;
             }
             // Iterate over search results
             for (SearchResult s : sr) {
                 // Print an index and found text:
                 System.out.println(String.format("At %d: %s", s.getPosition(), s.getText()));
             }
         }
         

        The following example shows how to search a text on pages:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Search a keyword with page numbers
             Iterable<SearchResult> sr = parser.search("lorem", new SearchOptions(false, false, false, true));
             // Check if search is supported
             if (sr == null) {
                 System.out.println("Search isn't supported");
                 return;
             }
             // Iterate over search results
             for (SearchResult s : sr) {
                 // Print an index, page number and found text:
                 System.out.println(String.format("At %d (%d): %s", s.getPosition(), s.getPageIndex(), s.getText()));
             }
         }
         
        Parameters:
        keyword - The keyword to search.
        options - The search options.
        Returns:
        A collection of SearchResult objects; null if search isn't supported.
      • getHighlight

        public HighlightItem getHighlight(int position,
                                 boolean isDirect,
                                 HighlightOptions options)
        Extracts a highlight from the document.

        The following example shows how to extract a highlight that contains 3 words:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SamplePdf)) {
             // Extract a highlight:
             HighlightItem hl = parser.getHighlight(2, true, new HighlightOptions(10, 3));
             // Check if highlight extraction is supported
             if (hl == null) {
                 System.out.println("Highlight extraction isn't supported");
                 return;
             }
             // Print an extracted highlight
             System.out.println(String.format("At %d: %s", hl.getPosition(), hl.getText()));
         }
         
        Parameters:
        position - The start position of the highlight.
        isDirect - The value that indicates whether highlight extraction is direct. true if the higlight is extracted by the right of position; otherwise, false.
        options - The highlight extraction options.
        Returns:
        An instance of HighlightItem class that represents the extracted highlight; null if highlight extraction isn't supported.
      • getToc

        public Iterable<TocItem> getToc()
        Extracts a table of contents from the document.

        The following example shows how to extract table of contents from EPUB file:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleEpub)) {
             // Check if text extraction is supported
             if (!parser.getFeatures().isText()) {
                 System.out.println("Text extraction isn't supported.");
                 return;
             }
             // Check if toc extraction is supported
             if (!parser.getFeatures().isToc()) {
                 System.out.println("Toc extraction isn't supported.");
                 return;
             }
             // Get table of contents
             Iterable<TocItem> toc = parser.getToc();
             // Iterate over items
             for (TocItem i : toc) {
                 // Print the Toc text
                 System.out.println(i.getText());
                 // Check if page index has a value
                 if (i.getPageIndex() == null) {
                     continue;
                 }
                 // Extract a page text
                 try (TextReader reader = parser.getText(i.getPageIndex())) {
                     System.out.println(reader.readToEnd());
                 }
             }
         }
         
        Returns:
        A collection of table of contents items; null if table of contents extraction isn't supported.
      • getMetadata

        public Iterable<MetadataItem> getMetadata()
        Extracts metadata from the document.

        The following example shows how to extract metadata from a document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleDocx)) {
             // Extract metadata from the document
             Iterable<MetadataItem> metadata = parser.getMetadata();
             // Check if metadata extraction is supported
             if (metadata == null) {
                 System.out.println("Metatada extraction isn't supported");
             }
             // Iterate over metadata items
             for (MetadataItem item : metadata) {
                 // Print an item name and value
                 System.out.println(String.format("%s: %s", item.getName(), item.getValue()));
             }
         }
         
        Returns:
        A collection of metadata items; null if metadata extraction isn't supported.
      • getContainer

        public Iterable<ContainerItem> getContainer()
        Extracts a container object from the document to work with formats that contain attachments, ZIP archives etc.

        The following example shows how to extract a text from zip entities:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleZip)) {
             // Extract attachments from the container
             Iterable<ContainerItem> attachments = parser.getContainer();
             // Check if container extraction is supported
             if (attachments == null) {
                 System.out.println("Container extraction isn't supported");
             }
             // Iterate over zip entities
             for (ContainerItem item : attachments) {
                 // Print the file path
                 System.out.println(item.getFilePath());
                 try {
                     // Create Parser object for the zip entity content
                     try (Parser attachmentParser = item.openParser()) {
                         // Extract an zip entity text
                         try (TextReader reader = attachmentParser.getText()) {
                             System.out.println(reader == null ? "No text" : reader.readToEnd());
                         }
                     }
                 } catch (UnsupportedDocumentFormatException ex) {
                     System.out.println("Isn't supported.");
                 }
             }
         }
         
        Returns:
        A collection of container items; null if container extraction isn't supported.
      • getTextAreas

        public Iterable<PageTextArea> getTextAreas()
        Extracts text areas from the document.

        The following example shows how to extract all text areas from the whole document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Extract text areas
             Iterable<PageTextArea> areas = parser.getTextAreas();
             // Check if text areas extraction is supported
             if (areas == null) {
                 System.out.println("Page text areas extraction isn't supported");
                 return;
             }
             // Iterate over page text areas
             for (PageTextArea a : areas) {
                 // Print a page index, rectangle and text area value:
                 System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
             }
         }
         
        Returns:
        A collection of PageTextArea objects; null if text areas extraction isn't supported.
      • getTextAreas

        public Iterable<PageTextArea> getTextAreas(PageTextAreaOptions options)
        Extracts text areas from the document using customization options (regular expression, match case, etc.).

        The following example shows how to extract only text areas with digits from the upper-left courner:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Create the options which are used for text area extraction
             PageTextAreaOptions options = new PageTextAreaOptions("\\s[a-z]{2}\\s", new Rectangle(new Point(0, 0), new Size(300, 100)));
             // Extract text areas which contain only digits from the upper-left corner of a page:
             Iterable<PageTextArea> areas = parser.getTextAreas(options);
             // Check if text areas extraction is supported
             if (areas == null) {
                 System.out.println("Page text areas extraction isn't supported");
                 return;
             }
             // Iterate over page text areas
             for (PageTextArea a : areas) {
                 // Print a page index, rectangle and text area value:
                 System.out.println(String.format("Page: %d, R: %s, Text: %s", a.getPage().getIndex(), a.getRectangle(), a.getText()));
             }
         }
         
        Parameters:
        options - The options for text area extraction.
        Returns:
        A collection of PageTextArea objects; null if text areas extraction isn't supported.
      • getTextAreas

        public Iterable<PageTextArea> getTextAreas(int pageIndex)
        Extracts text areas from the document page.

        To extract text areas from a document page the following method is used:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Check if the document supports text areas extraction
             if (!parser.getFeatures().isTextAreas()) {
                 System.out.println("Document isn't supports text areas extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
                 // Iterate over page text areas
                 // We ignore null-checking as we have checked text areas extraction feature support earlier
                 for (PageTextArea a : parser.getTextAreas(pageIndex)) {
                     // Print a rectangle and text area value:
                     System.out.println(String.format("R: %s, Text: %s", a.getRectangle(), a.getText()));
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        Returns:
        A collection of PageTextArea objects; null if text areas extraction isn't supported.
      • getTextAreas

        public Iterable<PageTextArea> getTextAreas(int pageIndex,
                                          PageTextAreaOptions options)
        Extracts text areas from the document page using customization options (regular expression, match case, etc.).
        Parameters:
        pageIndex - The zero-based page index.
        options - The options for text area extraction.
        Returns:
        A collection of PageTextArea objects; null if text areas extraction isn't supported.
      • getImages

        public Iterable<PageImageArea> getImages()
        Extracts images from the document.

        The following example shows how to extract all images from the whole document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Extract images
             Iterable<PageImageArea> images = parser.getImages();
             // Check if images extraction is supported
             if (images == null) {
                 System.out.println("Images extraction isn't supported");
                 return;
             }
             // Iterate over images
             for (PageImageArea image : images) {
                 // Print a page index, rectangle and image type:
                 System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
             }
         }
         
        Returns:
        A collection of PageImageArea objects; null if images extraction isn't supported.
      • getImages

        public Iterable<PageImageArea> getImages(PageAreaOptions options)
        Extracts images from the document using customization options (to set the rectangular area that contains images).

        The following example shows how to extract only images from the upper-left courner:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Create the options which are used for images extraction
             PageAreaOptions options = new PageAreaOptions(new Rectangle(new Point(340, 150), new Size(300, 100)));
             // Extract images from the upper-left corner of a page:
             Iterable<PageImageArea> images = parser.getImages(options);
             // Check if images extraction is supported
             if (images == null) {
                 System.out.println("Page images extraction isn't supported");
                 return;
             }
             // Iterate over images
             for (PageImageArea image : images) {
                 // Print a page index, rectangle and image type:
                 System.out.println(String.format("Page: %d, R: %s, Type: %s", image.getPage().getIndex(), image.getRectangle(), image.getFileType()));
             }
         }
         
        Parameters:
        options - The options for images extraction.
        Returns:
        A collection of PageImageArea objects; null if images extraction isn't supported.
      • getImages

        public Iterable<PageImageArea> getImages(int pageIndex)
        Extracts images from the document page.

        To extract images from a document page the following method is used:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleImagesPdf)) {
             // Check if the document supports images extraction
             if (!parser.getFeatures().isImages()) {
                 System.out.println("Document isn't supports images extraction.");
                 return;
             }
             // Get the document info
             IDocumentInfo documentInfo = parser.getDocumentInfo();
             // Check if the document has pages
             if (documentInfo.getPageCount() == 0) {
                 System.out.println("Document hasn't pages.");
                 return;
             }
             // Iterate over pages
             for (int pageIndex = 0; pageIndex < documentInfo.getPageCount(); pageIndex++) {
                 // Print a page number
                 System.out.println(String.format("Page %d/%d", pageIndex + 1, documentInfo.getPageCount()));
                 // Iterate over images
                 // We ignore null-checking as we have checked images extraction feature support earlier
                 for (PageImageArea image : parser.getImages(pageIndex)) {
                     // Print a rectangle and image type
                     System.out.println(String.format("R: %s, Text: %s", image.getRectangle(), image.getFileType()));
                 }
             }
         }
         
        Parameters:
        pageIndex - The zero-based page index.
        Returns:
        A collection of PageImageArea objects; null if images extraction isn't supported.
      • getImages

        public Iterable<PageImageArea> getImages(int pageIndex,
                                        PageAreaOptions options)
        Extracts images from the document page using customization options (to set the rectangular area that contains images).
        Parameters:
        pageIndex - The zero-based page index.
        options - The options for images extraction.
        Returns:
        A collection of PageImageArea objects; null if images extraction isn't supported.
      • parseByTemplate

        public DocumentData parseByTemplate(Template template)
        Parses the document by the user-generated template.
        Parameters:
        template - The user-generated template.
        Returns:
        An instance of DocumentData class that contains the extracted data; null if parsing by template isn't supported.
      • parseForm

        public DocumentData parseForm()
        Parses the document form.

        The following example shows how to parse a form of the document:

        // Create an instance of Parser class
         try (Parser parser = new Parser(Constants.SampleFormsPdf)) {
             // Extract data from PDF document
             DocumentData data = parser.parseForm();
             // Check if form extraction is supported
             if (data == null) {
                 System.out.println("Form extraction isn't supported.");
                 return;
             }
             // Iterate over extracted data
             for (int i = 0; i < data.getCount(); i++) {
                 System.out.print(data.get(i).getName() + ": ");
                 PageTextArea area = data.get(i).getPageArea() instanceof PageTextArea
                         ? (PageTextArea) data.get(i).getPageArea()
                         : null;
                 System.out.println(area == null ? "Not a template field" : area.getText());
             }
         }
         
        Returns:
        An instance of DocumentData class that contains the extracted data; null if parsing by template isn't supported.
      • getStructure

        public Document getStructure()
        Extracts a structured text from the document.
        Returns:
        An instance of org.w3c.dom.Document class with XML text structure; null if text structure extraction isn't supported.