Category Archives: apache-tika

Apache Tika REST-Server // Code 422 (Unprocessable Entity) for different states? -> How to distinguish?

The Apache Tika REST server provides for a PDF document with password status code 422 (Unprocessable Entity). If the file format is unsupported, 422 is sent as well.

Unfortunately, it is not ppssible to distinguish whether the metadata of a file could not be determined due to the encryption or the format.

When I call the file through the Tika app, I get either the message "encrypted file" or "format not valid" in the console.

Unfortunately, the result header also contains no additional information.

Example:

HTTP / 1.1 422 Unprocessable entity
Date: Fri, 11 May 2018 12:21:28 GMT
Content-Length: 0
Server: Jetty (8.y.z-SNAPSHOT)

Is there a way to get an additional description of error 422 after a REST call? Preferably via an extension of the header data.

Many Thanks, greetings Oliver

Apache tika converts embedded word pad file in a docx file to .bin file

I am trying to extract all the embedded files in a word file(docx) and put the embedded files in a seperate folder. I followed the example given by apache community here https://svn.apache.org/repos/asf/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java

though this is able to parse most of the embedded objects correctly but converts the embedded word pad files to OleObject.bin. I want to access the word pad file in the same format as they were embedded in the document as well.

I am new to Apache Tika and i am not able to find any solution for this through a normal google search, there was a mention of a fix related to my problem in v1.3 of Tika but I am using 1.18 so i think it is fixed and I might be missing something in the implementation, please help me with this issue.

Extract text from a pdf file using Apache Tika in java

try {
      File file = new File("Example.pdf");
      String content = new Tika().parseToString(file);
      System.out.println("The Content: " + content);
    } catch (Exception e) {
       e.printStackTrace();
    }

I have imported java.io.File and import org.apache.tika.Tika; but while running this code I am getting error like this :

Exception in thread "main" java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;Ljava/lang/Throwable;)V at org.apache.commons.logging.impl.SLF4JLocationAwareLog.warn(SLF4JLocationAwareLog.java:162) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.loadDiskCache(FileSystemFontProvider.java:461) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.(FileSystemFontProvider.java:217) at org.apache.pdfbox.pdmodel.font.FontMapperImpl$DefaultFontProvider.(FontMapperImpl.java:130) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getProvider(FontMapperImpl.java:149) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:413) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:376) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:350) at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146) at org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:79) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:167) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.Tika.parseToString(Tika.java:527) at org.apache.tika.Tika.parseToString(Tika.java:642) at java_programs.PdfParse.main(PdfParse.java:22)

Java replace text in file (.doc, .docx, .pdf, etc.)

I have a form where users can Email files to other users. Currently I take the file and email it to the other user.

I would like to scan the uploaded files for contact info and conceal it before emailing the file to the other user.

So if the file had an email in it, it would be replaced with something like "email-concealed".

I found this question which uses the POI library to replace text.

I would like to also be able to do this for other file types like .pdf. Is this possible.

I understand that Tika can parse a wide range of files. But can I use to to replace simple text (ie: Keep same file while replacing simple text). Would I have to parse back into whatever file it originally was? Is that possible?

Thank you.

Error undefined field: "stream_size" on solr

I try to use / update / extract in the field that references the database column in solr 6.3 but it does not work and this error appears to me:

Status: {"data":{"responseHeader":{"status":400,"QTime":8},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"undefined field: \"stream_size\"","code":400}},"status":400,"config":{"method":"POST","transformRequest":[null],"transformResponse":[null],"headers":{"Content-type":"application/json","Accept":"application/json, text/plain, */*"},"data":"[]","url":"/solr/TesteSisp/update%2Fextract","params":{"wt":"json","_":1486132402860,"commitWithin":1000,"boost":"1.0","overwrite":true},"timeout":10000},"statusText":"Bad Request"}
Response:

{
  "responseHeader": {
    "status": 0,
    "QTime": 5
  }
}

Someone knows what I can do?

Error when indexing file in solr with oracle database

I configured my data_config.xml this way:

<entity name="bop_anexo" processor="SqlEntityProcessor" query="SELECT ID_BOP_ANEXO,
                        ID_BOP_REFERENCIA,
                        NM_ANEXO,
                        TP_ANEXO,
                        DECODE64_CLOB3(REPLACE(ANEXO, 'data:application/pdf;base64,', '')) as ANEXO_CONVERTIDO,
                        ANEXO,
                        MINIATURA,
                        ID_SITUACAO,
                        DT_MANUTENCAO,
                        ID_USUARIO_MANUTENCAO
                FROM BOP_ANEXO WHERE TP_ANEXO = 'pdf'" transformer="ClobTransformer">
    <field column="ID_BOP_ANEXO" name="id"/>
    <field column="ID_BOP_REFERENCIA" name="id_bop_referencia"/>
    <field column="NM_ANEXO" name="nm_anexo"/>
    <field column="TP_ANEXO" name="tp_anexo"/>
    <field column="ANEXO_CONVERTIDO" name="anexo_convertido" clob="true"/> 
    <field column="ANEXO" name="anexo" clob="true"/> 
    <field column="ID_SITUACAO" name="id_situacao"/>
    <field column="DT_MANUTENCAO" name="dt_manutencao"/>
    <field column="ID_USUARIO_MANUTENCAO" name="id_usuario_manutencao"/>
</entity>

But when I try to execute dataimport, this error appears to me:

org.apache.solr.common.SolrException: TransactionLog doesn't know how to serialize class oracle.sql.CLOB; try implementing ObjectResolver?
at org.apache.solr.update.TransactionLog$1.resolve(TransactionLog.java:100)
at org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:206)
at org.apache.solr.common.util.JavaBinCodec.writeSolrInputDocument(JavaBinCodec.java:496)
at org.apache.solr.update.TransactionLog.write(TransactionLog.java:361)
at org.apache.solr.update.UpdateLog.add(UpdateLog.java:429)
at org.apache.solr.update.UpdateLog.add(UpdateLog.java:415)

And when I search on solr query, this result appears to me:

"responseHeader":{
"status":0,
"QTime":32,
"params":{
  "q":"*:*",
  "indent":"on",
  "wt":"json",
  "_":"1486041075119"}},
  "response":{"numFound":7,"start":0,"docs":[
      {
        "id_bop_referencia":"902",
        "miniatura":"[email protected]",
        "tp_anexo":"pdf",
        "anexo":"data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKN...",
        "anexo_convertido":"%PDF-1.4\n%âãÏÓ\n4 0 obj\n<</Type/XObject/ColorSpace/DeviceRGB/Subtype/Image/BitsPerComponent 8/Width 45/Length 4609/Height 48/Filter/DCTDecode>>stream\nÿØÿà\u0000\...",
        "id":"971",
        "nm_anexo":"report.pdf",
        "_version_":1557683947554471936},
          {

I have a base64_clob file type, and I converted it into the oracle database with an sql query, but solr and tika do not index the correct text, just as I showed it to you. Someone knows what can I do?