Page tree
Skip to end of metadata
Go to start of metadata
  1. What are the image file types supported?

    1. The following types of files are supported for extracting text:

      1. png

      2. jpg

      3. jpeg

      4. pdf 

  2. What are the languages supported?

    1. Currently following languages are supported: 

      1. English
      2. Danish
      3. Dutch
      4. French
      5. German
      6. Italian
      7. Polish
      8. Portuguese
      9. Russian
      10. Spanish
      11. Swedish

  3. When are the image files indexed for searching?

    1. When a page is added or edited with new attachments it is added to the index queue of Confluence. As the page gets indexed all the attachments (images) also get indexed. Also, when you click on "ExtracText" button on any image a page reindex is triggered which will reindex all the attachments/images in the page.

  4. How to index the old image files that were added prior to installing ExtracText?

    1. If you run a regular confluence reindex, all the attachments (images) that are not indexed so far will get reindexed. If you just need to index one page, either edit that page and save it or click on "ExtracText" button on any image in that page.

  5. How to force rerun text extraction for an image?

    1. By default, the text extraction happens only once on a given attachment and you cannot force it. Once the text is extracted it is cached and you cannot delete this cache. But if you delete the existing attachment and add the same image as a new attachment then it will be reindexed again.

  6. How to view the whole text extracted from an image?

    1. The whole text extracted from the image is added to search index and the attachments can be found by doing a regular confluence search. But if you want to view the whole text extracted from any image, mouse over that image while viewing it on any page and click on the "ExtracText" button

  7. How to search for text extracted from images?

    1. The text is added to the Confluence index and you can find attachments by simply doing a Confluence search. You could combine with other CQL syntax to filter only attachments for instance (type=attachment).

  8. I cannot find the attachment even though I give the correct search string?

    1. To verify an image is indexed, go to the page which contains the image and hover over the image and click on "ExtracText". Verify the text you are looking for is correctly identified. After allowing few seconds for the page to get reindexed, search for the text you are looking for and it should be found. If your Confluence administrator runs a reindex of the whole content after ExtracText is installed, then you should not have this issue. By clicking on "ExtracText" button you are ensuring the page is reindexed with all the attachments. Also this issue should not happen for new attachments that are added after ExtracText is installed as they get indexed at the time of saving the page. Alternately try doing fuzzy search to see if some characters were not identified correctly by the OCR engine. For example if you want to search for a word like "Chrome browser" you could use a JQL like ExtracText ~ "Chrome~ browser~". This will return the results even if it matches "Ohrome drowser".

  9. When trying to install ExtracText or while restarting Confluence, ExtracText app is disabled. How to enable it?

    1. ExtracText will not get enabled if the runtime free memory is less than 300MB. You will see an error message in the log that says "Free memory is xxxMB which is less than required memory (300MB) for ExtracText. Hence not proceeding with the install". This could happen on instances running on very low memory (say around 1GB) for confluence. You can try enabling the plugin after some time to see if memory may have freed up or increase memory allotted to Confluence as per guidelines.
    2. If you are trying to install ExtracText on Windows make sure you have Visual C++ 2015 redistributable installed before installing ExtracText. If not, uninstall ExtracText, install the redistributable from here and then install ExtracText again. 

  10. What can I do to make sure the text extracted is better in quality?

    1. Following are some tips to get better results of text extraction:
      1. The images are on a clear background - Although ExtracText can detect text in any background color, it may fail if a single block of text (a word, sentence or a paragraph) has different background colors. 
      2. ExtracText is tuned for detecting text on machine/computer generated images like screenshots or scanned documents. It is not good for detecting text on natural images and hence it may fail to detect text in images captured from cameras
      3. When saving screenshot images, choose PNG as file format which provides lossless compression for images
      4. Avoid having noise in images ("noise" are randomly distributed pixels in the background)
      5. Text is not too small (characters less than 10 pixels height) or too big (characters greater than 40 pixels in width)
      6. Text is properly segmented (e.g. if text in multiple windows are overlapping one another in the screenshot results may not be accurate)

  11. Where can I download the jar file for installation?

    Visit https://marketplace.atlassian.com/apps/1219718 and click on Download link.
  12. Unable to install ExtracText on 32 bit Linux system. How to fix it?

    1. If you need support to install ExtracText on a 32 bit Linux system, please contact us.

  13. Why is the text from images in my PDF document are not extracted?

    1. ExtracText only extracts text if the PDF document contains "ONLY" images (like in scanned documents). If the document contains text or any other content ExtracText does not process that file and lets Confluence to index the text in the PDF.

  14. The output from images on my PDF document are totally wrong. How to fix it?

    1. If your pdf document contains images that are rotated, then you may see just junk characters being printed. Verify that the image is positioned in the PDF. If you are scanning documents, make sure to get the direction right. Note that ExtracText can handle some degree of skew.