In [1]:
sf
2019/06/14 22:45:23 [FATAL] expecting one or more file or directory arguments (or '-' to scan stdin)

In [11]:
sf -sig deluxe.sig tika-app.jar
---
siegfried   : 1.7.11
scandate    : 2019-06-16T16:30:37Z
signature   : deluxe.sig
created     : 2019-02-16T11:10:03+01:00
identifiers : 
  - name    : 'pronom'
    details : 'DROID_SignatureFile_V94.xml; container-signature-20180917.xml'
  - name    : 'tika'
    details : 'tika-mimetypes.xml (1.20, 2018-12-17)'
  - name    : 'freedesktop.org'
    details : 'freedesktop.org.xml (1.10, 2018-28-06)'
  - name    : 'loc'
    details : 'fddXML.zip (2019-01-06, DROID_SignatureFile_V94.xml, container-signature-20180917.xml)'
---
filename : 'tika-app.jar'
filesize : 75707056
modified : 2019-06-16T16:12:14Z
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/412'
    format  : 'Java Archive Format'
    version : 
    mime    : 'application/java-archive'
    basis   : 'extension match jar; container name META-INF/MANIFEST.MF with byte match at 0, 17'
    warning : 
  - ns      : 'tika'
    id      : 'application/java-archive'
    format  : 'Java Archive'
    mime    : 'application/java-archive'
    basis   : 'extension match jar; byte match at 0, 4 (signature 1/3)'
    warning : 
  - ns      : 'freedesktop.org'
    id      : 'application/x-java-archive'
    format  : 'Java archive'
    mime    : 'application/x-java-archive'
    basis   : 'extension match jar; byte match at 0, 4'
    warning : 
  - ns      : 'loc'
    id      : 'UNKNOWN'
    format  : 
    full    : 
    mime    : 
    basis   : 
    warning : 'no match'
In [10]:
sf -json -sig deluxe.sig tika-app.jar | python -m json.tool
{
    "siegfried": "1.7.11",
    "scandate": "2019-06-16T16:30:23Z",
    "signature": "deluxe.sig",
    "created": "2019-02-16T11:10:03+01:00",
    "identifiers": [
        {
            "name": "pronom",
            "details": "DROID_SignatureFile_V94.xml; container-signature-20180917.xml"
        },
        {
            "name": "tika",
            "details": "tika-mimetypes.xml (1.20, 2018-12-17)"
        },
        {
            "name": "freedesktop.org",
            "details": "freedesktop.org.xml (1.10, 2018-28-06)"
        },
        {
            "name": "loc",
            "details": "fddXML.zip (2019-01-06, DROID_SignatureFile_V94.xml, container-signature-20180917.xml)"
        }
    ],
    "files": [
        {
            "filename": "tika-app.jar",
            "filesize": 75707056,
            "modified": "2019-06-16T16:12:14Z",
            "errors": "",
            "matches": [
                {
                    "ns": "pronom",
                    "id": "x-fmt/412",
                    "format": "Java Archive Format",
                    "version": "",
                    "mime": "application/java-archive",
                    "basis": "extension match jar; container name META-INF/MANIFEST.MF with byte match at 0, 17",
                    "warning": ""
                },
                {
                    "ns": "tika",
                    "id": "application/java-archive",
                    "format": "Java Archive",
                    "mime": "application/java-archive",
                    "basis": "extension match jar; byte match at 0, 4 (signature 1/3)",
                    "warning": ""
                },
                {
                    "ns": "freedesktop.org",
                    "id": "application/x-java-archive",
                    "format": "Java archive",
                    "mime": "application/x-java-archive",
                    "basis": "extension match jar; byte match at 0, 4",
                    "warning": ""
                },
                {
                    "ns": "loc",
                    "id": "UNKNOWN",
                    "format": "",
                    "full": "",
                    "mime": "",
                    "basis": "",
                    "warning": "no match"
                }
            ]
        }
    ]
}
In [26]:
ls -ld /usr/share/siegfried/*
-rw-rw-r-- 1 2000 2000 2393472 Feb 16 10:40 /usr/share/siegfried/DROID_SignatureFile_V94.xml
-rw-rw-r-- 1 2000 2000     195 Feb 16 10:40 /usr/share/siegfried/Install.txt
-rw-rw-r-- 1 2000 2000  130108 Feb 16 10:40 /usr/share/siegfried/archivematica.sig
-rw-rw-r-- 1 2000 2000  158869 Feb 16 10:40 /usr/share/siegfried/container-signature-20180917.xml
drwxrwxr-x 2 2000 2000    4096 Jun 14 22:40 /usr/share/siegfried/custom
-rw-rw-r-- 1 2000 2000  131942 Jun 15 22:58 /usr/share/siegfried/default.sig
-rw-rw-r-- 1 2000 2000  309192 Feb 16 10:40 /usr/share/siegfried/deluxe.sig
-rw-rw-r-- 1 2000 2000 1923318 Feb 16 10:40 /usr/share/siegfried/fddXML.zip
-rw-rw-r-- 1 2000 2000 2280389 Feb 16 10:40 /usr/share/siegfried/freedesktop.org.xml
-rw-rw-r-- 1 2000 2000   68174 Feb 16 10:40 /usr/share/siegfried/freedesktop.sig
-rw-rw-r-- 1 2000 2000   30608 Feb 16 10:40 /usr/share/siegfried/loc.sig
-rw-rw-r-- 1 2000 2000     115 Feb 16 10:40 /usr/share/siegfried/mime-info.json
drwxrwxr-x 2 2000 2000   61440 Jun 14 22:40 /usr/share/siegfried/pronom
-rw-rw-r-- 1 2000 2000  250477 Feb 16 10:40 /usr/share/siegfried/pronom-tika-loc.sig
-rw-rw-r-- 1 2000 2000  606791 Feb 16 10:40 /usr/share/siegfried/release-notes.xml
drwxrwxr-x 2 2000 2000    4096 Jun 14 22:40 /usr/share/siegfried/sets
-rw-rw-r-- 1 2000 2000  255194 Feb 16 10:40 /usr/share/siegfried/tika-mimetypes.xml
-rw-rw-r-- 1 2000 2000   94127 Feb 16 10:40 /usr/share/siegfried/tika.sig
In [22]:
sh ~/droid/droid.sh -Nr test-files/lorem-ipsum.txt -q -Ns /usr/share/siegfried/DROID_SignatureFile_V94.xml -Nc /usr/share/siegfried/container-signature-20180917.xml
12:59:11,353  INFO [main] ReflectionServiceFactoryBean:399 - Creating Service {http://pronom.nationalarchives.gov.uk}PronomServiceService from class uk.gov.nationalarchives.pronom.PronomService
/home/andy/test-files/lorem-ipsum.txt,Unknown
In [15]:
sh ~/droid/droid.sh -h
21:16:44,905  INFO [main] ReflectionServiceFactoryBean:399 - Creating Service {http://pronom.nationalarchives.gov.uk}PronomServiceService from class uk.gov.nationalarchives.pronom.PronomService
usage: droid [options]
OPTIONS:
  -c,--check-signature-update         Check whether signature updates are
                                      available for download.
  -d,--download-signature-update      Download the latest signature updates, if
                                      a newer version is available.
  -h,--help                           Display this help.  More help is
                                      available using the help menu in the
                                      graphical user interface.
  -l,--list-reports                   List the available reports and output
                                      formats.
  -s,--set-signature-file <version>   Set the current default binary signature
                                      file version.  For example:
                                      droid -s 42
  -v,--version                        Display the version of the DROID
                                      software.
  -x,--display-signature-file         Display the current default signature
                                      file.
  -X,--list-signature-files           List all locally available signature
                                      files.
  -Nr,--no-profile-resource <folder>  Identify either a specific file, or all
                                      files in a folder, without the use of a
                                      profile.  The file or folder path should
                                      be bounded by double quotes.  The scan
                                      results will be sent to standard output.
                                      For example: droid -Nr "C:\Files\A
                                      Folder"
                                      Note: You cannot use reporting, filtering
                                      and exporting when using the -Nr option.
     -A,--open-archives                  [optional] Open archive (zip, tar,
                                         gzip, rar, 7zip, bzip2, iso) files and
                                         identify all their contents.
     -Nc,--container-file <filename>     [optional] The container signature
                                         file to be used for identification.
                                         If omitted, container-format files may
                                         be identified by container type only.
     -Ns,--signature-file <filename>     Specify the signature file to be used
                                         for identification.
     -Nx,--extension-list <extensions>   [optional] Only identify files with
                                         the given extensions
                                         For example: -Nx csv jp2
     -q,--quiet                          [optional] When run in PROFILE mode
                                         DROID will limit its console output to
                                         errors only.  When run in NO PROFILE
                                         mode DROID will limit its output to
                                         CSV data only.
     -R,--recurse                        [optional] Recurse into all subfolders
                                         of any folder specified using the -a
                                         or -Nr options. Files in all
                                         sub-folders (and their sub-folders,
                                         and so on) will be processed as well.
                                         If this option is omitted and a folder
                                         is specified, only the files directly
                                         under the folder will be processed.
                                         For example:
                                         droid -R -a "C:\Files\Another Folder"
                                         -p "C:\Results\result3.droid"
     -W,--open-webarchives               [optional] Open ARC or WARC files and
                                         identify their contents
  -a,--profile-resources <resources>  Add resources to a new profile and run
                                      it.  Resources are the file path of any
                                      file or folder you want to profile.  The
                                      file paths should be given surrounded in
                                      double quotes, and separated by spaces
                                      from each other.  The profile results
                                      will be saved to a single file specified
                                      using the -p option.
                                      For example: droid -a "C:\Files\A Folder"
                                      "C:\Files\file.xxx" -p
                                      "C:\Results\result1.droid"
                                      Note: You cannot use reporting, filtering
                                      and exporting when using the -a option.
     -p,--profile(s) <filename(s)>       When used in conjunction with
                                         reporting, filtering or exporting, -p
                                         specifies a list of profiles to open.
                                         The file paths of the profiles should
                                         be bounded by double quotes, and
                                         separated by spaces from each other.
                                         When used in conjunction with the -a
                                         option, the results of the new profile
                                         will be saved to that file, and you
                                         can only specify a single file.
                                         For example: droid -p
                                         "C:\Results\result1.droid"
                                         "C:\Results\result2.droid" -e
                                         "C:\ExportscombinedResults.csv"
                                         droid -a "C:\Files\A Folder"
                                         "C:\Files\file.xxx" -p
                                         "C:\Results\result1.droid"
     -q,--quiet                          [optional] When run in PROFILE mode
                                         DROID will limit its console output to
                                         errors only.  When run in NO PROFILE
                                         mode DROID will limit its output to
                                         CSV data only.
     -R,--recurse                        [optional] Recurse into all subfolders
                                         of any folder specified using the -a
                                         or -Nr options. Files in all
                                         sub-folders (and their sub-folders,
                                         and so on) will be processed as well.
                                         If this option is omitted and a folder
                                         is specified, only the files directly
                                         under the folder will be processed.
                                         For example:
                                         droid -R -a "C:\Files\Another Folder"
                                         -p "C:\Results\result3.droid"
  -e,--export-file <filename>         Export profiles to a CSV file with one
                                      row per profiled file.  If any filters
                                      are specified, then they will apply to
                                      the exported file.
                                      For example: droid -p
                                      "C:\Results\result1.droid"
                                      "C:\Results\result2.droid" -e
                                      "C:\ExportscombinedResults.csv"
                                      droid -p "C:\Results\result3.droid" -f
                                      "PUID any_of fmt/111 fmt/112" -e
                                      "C:\Exports\filteredResults.csv"
     -B,--bom                            Save file with BOM - Byte order mark.
     -F,--filter-any <filter ...>        [optional] Filter profiles as the -f
                                         option does, except results which
                                         match ANY of the specified filter
                                         criteria will appear.
     -f,--filter-all <filter ...>        [optional] Filter the profiles
                                         specified using the -p option.  Only
                                         results which match ALL filter
                                         criteria specified will appear.
                                         Filter criteria are specified using
                                         the following method:
                                         <field><operator><values> where
                                         <field> is the name of a filterable
                                         field, <operator> is the type of
                                         comparison to use, and <values> are
                                         the value or values against which the
                                         field value should be compared.  The
                                         -k option provides information on the
                                         available fields and operators.  You
                                         can specify more than one filter
                                         criteria, bounded by double quotes and
                                         separated by spaces from each other.
                                         For example:
                                         droid -p "C:\Results\result3.droid"
                                         -f "PUID any_of fmt/111 fmt/112"
                                         -e "C:\Exports\filteredResults.csv"
                                         droid -p "C:\Results\result1.droid"
                                         "C:\Results\result2.droid"
                                         -f "file_size > 0"
                                         -e "C:\Exports\filteredResults.csv"
     -p,--profile(s) <filename(s)>       When used in conjunction with
                                         reporting, filtering or exporting, -p
                                         specifies a list of profiles to open.
                                         The file paths of the profiles should
                                         be bounded by double quotes, and
                                         separated by spaces from each other.
                                         When used in conjunction with the -a
                                         option, the results of the new profile
                                         will be saved to that file, and you
                                         can only specify a single file.
                                         For example: droid -p
                                         "C:\Results\result1.droid"
                                         "C:\Results\result2.droid" -e
                                         "C:\ExportscombinedResults.csv"
                                         droid -a "C:\Files\A Folder"
                                         "C:\Files\file.xxx" -p
                                         "C:\Results\result1.droid"
  -E,--export-format <filename>       Export profiles to a CSV file with one
                                      row per profiled file/format.  If any
                                      filters are specified, then they will
                                      apply to the exported file.
                                      For example: droid -p
                                      "C:\Results\result1.droid"
                                      "C:\Results\result2.droid" -E
                                      "C:\ExportscombinedResults.csv"
                                      droid -p "C:\Results\result3.droid" -f
                                      "PUID any_of fmt/111 fmt/112" -E
                                      "C:\Exports\filteredResults.csv"
     -B,--bom                            Save file with BOM - Byte order mark.
     -F,--filter-any <filter ...>        [optional] Filter profiles as the -f
                                         option does, except results which
                                         match ANY of the specified filter
                                         criteria will appear.
     -f,--filter-all <filter ...>        [optional] Filter the profiles
                                         specified using the -p option.  Only
                                         results which match ALL filter
                                         criteria specified will appear.
                                         Filter criteria are specified using
                                         the following method:
                                         <field><operator><values> where
                                         <field> is the name of a filterable
                                         field, <operator> is the type of
                                         comparison to use, and <values> are
                                         the value or values against which the
                                         field value should be compared.  The
                                         -k option provides information on the
                                         available fields and operators.  You
                                         can specify more than one filter
                                         criteria, bounded by double quotes and
                                         separated by spaces from each other.
                                         For example:
                                         droid -p "C:\Results\result3.droid"
                                         -f "PUID any_of fmt/111 fmt/112"
                                         -e "C:\Exports\filteredResults.csv"
                                         droid -p "C:\Results\result1.droid"
                                         "C:\Results\result2.droid"
                                         -f "file_size > 0"
                                         -e "C:\Exports\filteredResults.csv"
     -p,--profile(s) <filename(s)>       When used in conjunction with
                                         reporting, filtering or exporting, -p
                                         specifies a list of profiles to open.
                                         The file paths of the profiles should
                                         be bounded by double quotes, and
                                         separated by spaces from each other.
                                         When used in conjunction with the -a
                                         option, the results of the new profile
                                         will be saved to that file, and you
                                         can only specify a single file.
                                         For example: droid -p
                                         "C:\Results\result1.droid"
                                         "C:\Results\result2.droid" -e
                                         "C:\ExportscombinedResults.csv"
                                         droid -a "C:\Files\A Folder"
                                         "C:\Files\file.xxx" -p
                                         "C:\Results\result1.droid"
  -r,--report <filename>              Save the report generated to the file
                                      specified.  For example:
                                      droid -p "C:\Results\result1.droid" -n
                                      "PLANETS" -r
                                      "C:\Reports\result1Report.xml"
     -n,--report-name <report name>      Run the report with the specified name
                                         on any profiles opened using the -p
                                         option.  For example:
                                         droid -p "C:Results\result1.droid" -n
                                         "PLANETS" -r
                                         "C:\Reports\result1Report.xml"
     -p,--profile(s) <filename(s)>       When used in conjunction with
                                         reporting, filtering or exporting, -p
                                         specifies a list of profiles to open.
                                         The file paths of the profiles should
                                         be bounded by double quotes, and
                                         separated by spaces from each other.
                                         When used in conjunction with the -a
                                         option, the results of the new profile
                                         will be saved to that file, and you
                                         can only specify a single file.
                                         For example: droid -p
                                         "C:\Results\result1.droid"
                                         "C:\Results\result2.droid" -e
                                         "C:\ExportscombinedResults.csv"
                                         droid -a "C:\Files\A Folder"
                                         "C:\Files\file.xxx" -p
                                         "C:\Results\result1.droid"
     -t,--report-type <report type>      Set the output file format of a
                                         report.
In [25]:
java -jar tika-app.jar --help
usage: java -jar tika-app.jar [option...] [file|port...]

Options:
    -?  or --help          Print this usage message
    -v  or --verbose       Print debug level messages
    -V  or --version       Print the Apache Tika version number

    -g  or --gui           Start the Apache Tika GUI
    -s  or --server        Start the Apache Tika server
    -f  or --fork          Use Fork Mode for out-of-process extraction

    --config=<tika-config.xml>
        TikaConfig file. Must be specified before -g, -s, -f or the dump-x-config !
    --dump-minimal-config  Print minimal TikaConfig
    --dump-current-config  Print current TikaConfig
    --dump-static-config   Print static config
    --dump-static-full-config  Print static explicit config

    -x  or --xml           Output XHTML content (default)
    -h  or --html          Output HTML content
    -t  or --text          Output plain text content
    -T  or --text-main     Output plain text content (main content only)
    -m  or --metadata      Output only metadata
    -j  or --json          Output metadata in JSON
    -y  or --xmp           Output metadata in XMP
    -J  or --jsonRecursive Output metadata and content from all
                           embedded files (choose content type
                           with -x, -h, -t or -m; default is -x)
    -l  or --language      Output only language
    -d  or --detect        Detect document type
           --digest=X      Include digest X (md2, md5, sha1,
                               sha256, sha384, sha512
    -eX or --encoding=X    Use output encoding X
    -pX or --password=X    Use document password X
    -z  or --extract       Extract all attachements into current directory
    --extract-dir=<dir>    Specify target directory for -z
    -r  or --pretty-print  For JSON, XML and XHTML outputs, adds newlines and
                           whitespace, for better readability

    --list-parsers
         List the available document parsers
    --list-parser-details
         List the available document parsers and their supported mime types
    --list-parser-details-apt
         List the available document parsers and their supported mime types in apt format.
    --list-detectors
         List the available document detectors
    --list-met-models
         List the available metadata models, and their supported keys
    --list-supported-types
         List all known media types and related information


    --compare-file-magic=<dir>
         Compares Tika's known media types to the File(1) tool's magic directory
Description:
    Apache Tika will parse the file(s) specified on the
    command line and output the extracted text content
    or metadata to standard output.

    Instead of a file name you can also specify the URL
    of a document to be parsed.

    If no file name or URL is specified (or the special
    name "-" is used), then the standard input stream
    is parsed. If no arguments were given and no input
    data is available, the GUI is started instead.

- GUI mode

    Use the "--gui" (or "-g") option to start the
    Apache Tika GUI. You can drag and drop files from
    a normal file explorer to the GUI window to extract
    text content and metadata from the files.

- Server mode

    Use the "--server" (or "-s") option to start the
    Apache Tika server. The server will listen to the
    ports you specify as one or more arguments.

- Batch mode

    Simplest method.
    Specify two directories as args with no other args:
         java -jar tika-app.jar <inputDirectory> <outputDirectory>

Batch Options:
    -i  or --inputDir          Input directory
    -o  or --outputDir         Output directory
    -numConsumers              Number of processing threads
    -bc                        Batch config file
    -maxRestarts               Maximum number of times the 
                               watchdog process will restart the child process.
    -timeoutThresholdMillis    Number of milliseconds allowed to a parse
                               before the process is killed and restarted
    -fileList                  List of files to process, with
                               paths relative to the input directory
    -includeFilePat            Regular expression to determine which
                               files to process, e.g. "(?i)\.pdf"
    -excludeFilePat            Regular expression to determine which
                               files to avoid processing, e.g. "(?i)\.pdf"
    -maxFileSizeBytes          Skip files longer than this value

    Control the type of output with -x, -h, -t and/or -J.

    To modify child process jvm args, prepend "J" as in:
    -JXmx4g or -JDlog4j.configuration=file:log4j.xml.
In [ ]: