Doc Processor
Like the others, this script takes anything that OpenOffice can read and turns it into animated GIFs.
- Creates a series of web pages that contain a thumbnail of all readable docs
- Gathers details about the files such as Exif data
- Can gather whatever data you can think of due to plugins
File Types That Should Work With The Script (Source: HTTP://wiki.services.openoffice.org/wiki/Documentation/OOo3_User_Guides/Getting_Started/File_formats)
-
Microsoft Word 6.0/95/97/2000/XP) (.doc and .dot)
-
Microsoft Word 2003 XML (.XML)
-
Microsoft Word 2007 XML (.docx, .docm, .dotx, .dotm)
-
Microsoft WinWord 5 (.doc)
-
WordPerfect Document (.wpd)* WPS 2000/Office 1.0 (.wps)
-
.rtf, .txt, and .csv
-
StarWriter formats (.sdw, .sgl, .vor)
-
DocBook (.xml)
-
Unified Office Format text (.uot, .uof)
-
Ichitaro 8/9/10/11 (.jtd and .jtt)
-
Hangul WP 97 (.hwp)
-
T602 Document (.602, .txt)
-
AportisDoc (Palm) (.pdb)
-
Pocket Word (.psw)
-
Microsoft Excel 97/2000/XP (.xls, .xlw, and .xlt)
-
Microsoft Excel 4.x–5.0/95 (.xls, .xlw, and .xlt)
-
Microsoft Excel 2003 XML (.xml)
-
Microsoft Excel 2007 XML (.xlsx, .xlsm, .xltx, .xltm)
-
Microsoft Excel 2007 binary (.xlsb)
-
Lotus 1-2-3 (.wk1, .wks, and .123)
-
Data Interchange Format (.dif)
-
Rich Text Format (.rtf)
-
Text CSV (.csv and .txt)
-
StarCalc formats (.sdc and .vor)
-
dBASE (.dbf)
-
SYLK (.slk)
-
Unified Office Format spreadsheet (.uos, .uof)
-
.htm and .html files, including Web page queries
-
Pocket Excel (pxl)
-
Quattro Pro 6.0 (.wb2)
-
Microsoft PowerPoint 97/2000/XP (.ppt, .pps, and .pot)
-
Microsoft PowerPoint 2007 (.pptx, .pptm, .potx, .potm)
-
StarDraw and StarImpress (.sda, .sdd, .sdp, and .vor)
-
Unified Office Format presentation (.uop, .uof)
-
CGM – Computer Graphics Metafile (.cgm)
-
Portable Document Format (.pdf)
- Oh and any Open Office documents :)
Requirements Perl modules: Getopt::Long, Pod::Usage, File::Basename, Config::IniFiles, OLE::Storage, Unicode::Map, Startup, Image::ExifTool, Digest::MD5, Digest::SHA, OLE::PropertySet, Getopt::Std Libraries and packages installed: Imagemagick, Ghostscript, unoconv
Unoconv can be obtained at: http://dag.wieers.com/home-made/unoconv/
Standard Plugins exif.pl — Uses Exif to dump whatever metadata it can find in the file. md5.pl — Calculates the MD5 hash for the file. sha.pl — Calculates the SHA 512 has for the file. WMD.pl — A perl script written by Mr. Harlan Carvey for dumping metadata from Word documents.
Installation
- Install OpenOffice
- Install the listed Perl modules
- Install the other binary requirements such as Imagemagic, Ghostscript, and unoconv. If you’re running Fedora, all three can be installed via yum.
INI File
The INI file (data_processor.ini) contains the user configurable options for each one of the data processor scripts.
Each line has a comment before the parameter. See the INI file for more details.
Screenshots
Here are the mandatory screenshots. :) Click on the image to bring up a larger version.




Running The Program
Commandline Example: ./docs-processor.pl –inputdir /export/data_carver_processors/doc_exam –output doc-index –plugindir /export/data_carver_processors/docs-plugins –ini /export/data_carver_processors/data_processor.ini
After the program has gone through the documents, bring up your favorite web browser and open up the file you gave it with the –output option. In the above case, I would open up doc-index.html in the directory where I ran docs-processor.pl from.
Options
| –ini FILE |
Ini File (configuration) |
| –title TITLE |
Head page with this title. |
| –inputdir DIR |
Input directory |
| –output FILE |
Name output file with this name instead “index.html” |
| –plugindir DIR |
Plugin directory |
| –imagenum NUMBER |
Number of thumbnails per page; default is 2000 |
| –perrow NUMBER |
Number of thumbnails per row; default is 4 |
| –imagesize NUMBER |
Size of the thumbnails in pixels; default is 150 pixels |
| –quality 0..100 |
Quality of the thumbnails from 0 to 100; default is 80 |
| –help or –man |
Show this text and exits |
Other Notes
Feedback: Please send me an email with any features/plug-ins you would like to see. If you find any errors with the scripts, let me know. I am also interested any plug-ins you want to share. If you like the program, let me know, too. I don’t mind positive feedback.
Errors: As the script runs over the files you may see some errors outputted. The errors are from the programs running on the recovered files. Not all of the files that the data carvers recover are good files. Hence, the errors.
License: GPL 2.0
Download at: data_carver_processors.tar.gz (All of the data carver processor scripts are included in this file)
Contact: cs[at]citadelsystems.net |