stripping logos from scanned PDF files

Advice and Help

Moderator: kcleung

kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Post by kcleung »

Carolus wrote:I wonder if we should consider setting up an FTP server where unlocked score files with logos present could be stored for processing. That way, with several people working together, a considerable number of titles could be processed using some of the methods outline above and ultimately added to the collection.
That's a great idea!!!! We should definitely arrange ftp servers as soon as possible so that multiple people can work in these collections! This would also *lower* the threshold for potential contributors.

They would *not* need to have access to a scanner, printer nor score collection. All they need are internet access (preferably broadband) and a computer that is under four years old.


Also in this way, the money originally planned to be spent on buying the OM series ourselves can be better used in setting up the server and other required infrastructure.
Yagan Kiely
Site Admin
Posts: 1139
Joined: Sun Jan 14, 2007 8:16 am
notabot: YES
notabot2: Bot
Location: Perth, Australia
Contact:

Post by Yagan Kiely »

The PDF's on the CDs are Password protected, is that a problem with anyone else?

What guide would you suggest for a mac user?
ras1
active poster
Posts: 164
Joined: Thu Jul 26, 2007 8:28 pm

Post by ras1 »

Open with Preview, then do File->Print. Select PDF->Save as PDF. This unlocked it for me, on Mac OS 10.4.
Yagan Kiely
Site Admin
Posts: 1139
Joined: Sun Jan 14, 2007 8:16 am
notabot: YES
notabot2: Bot
Location: Perth, Australia
Contact:

Post by Yagan Kiely »

Yes, I did that too... very slow however...
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Post by Carolus »

The PDF locking can be easily "picked." That much is an entirely 'automated' process with the correct software (I use PDF Key Pro on my Mac). I've already asked Feldmahler about setting up an FTP site for this purpose. The files I will be uploading will already be unlocked anyway.

The processing as described by Daphnis would take care of the embedded meta-tags, etc. in addition to the more obvious stripping of logos and trademarks. It would really be great if a similar process could be developed for the Google scans - of which there are quite a few available. The Google items have been done with a very bizarre process. Most of the pages are a nice 600dpi monochrome, but every so often a single system or sometimes a half-page appears in 150dpi grayscale as a seperate graphic on the page.

Google was embedding their logo as a watermark on some of the scores (but not the later ones, it appears). Microsoft is even worse on the scans they've done for the New York Public Library. As it stands right now, the logos, extensive meta-tags, etc. embedded in their scans renders them unsuitable for IMSLP.
Lyle Neff
active poster
Posts: 702
Joined: Wed Mar 14, 2007 3:21 pm
notabot: 42
notabot2: Human
Location: Delaware, USA
Contact:

Post by Lyle Neff »

Can you remove meta-tags in Acrobat 6.0? If so, how?
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Post by kcleung »

Just use the method I mentioned a bit earlier in this thread (you also need to install ghostscript for windows) and it will only take the necessary bits and leave all the metadata behind.

the password only prevents you from changing the data (and reading metadata) but *not* printing :)

I tried one of the files in the CD and it works. It only take me 8 minutes to strip a 50-page document.
ras1
active poster
Posts: 164
Joined: Thu Jul 26, 2007 8:28 pm

Post by ras1 »

Let me know if you're planning on setting up a server to remove logos - I have the Ravel/Elgar/etc. Violin one and no time to do it myself.
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Post by kcleung »

We should talk to Feldmahler (or others in the central admin) to set up an ftp server for all PDF files "infected" with logos urgently. Since this server will target CDSM, which release items PD in USA, perhaps the server should also be in USA. Then contributors can upload infected files or check out the entries to strip the logos.
Lyle Neff
active poster
Posts: 702
Joined: Wed Mar 14, 2007 3:21 pm
notabot: 42
notabot2: Human
Location: Delaware, USA
Contact:

Post by Lyle Neff »

kcleung wrote:Just use the method I mentioned a bit earlier in this thread (you also need to install ghostscript for windows) and it will only take the necessary bits and leave all the metadata behind.

the password only prevents you from changing the data (and reading metadata) but *not* printing :)

I tried one of the files in the CD and it works. It only take me 8 minutes to strip a 50-page document.
Could you list briefly the steps all in one place? (Ghostscript is installed on my computer, but I haven't used it myself in years.)
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Stripping logos off PDF files in Linux, Mac and possibly win

Post by kcleung »

Requirements:

ghostscript
an image editing tool (e.g. gimp)
tiffcp: http://www.stillhq.com/pngtools/
tiff2pdf: http://www.libtiff.org/tools.html

Under Linux, all the software mentioned above are pre-packaged and I believe that they should be readily available in mac. But you may have to compile them for windows.

To perform batch jobs like this, the most cost effective way is to run Linux!!!!!

Steps:

1. Put each parts file in its own subdirectory and change to the subdirectory of one of the parts

2. run as one line:

gs -sDEVICE=tiffg4 -dNOPAUSE -r300 -dBATCH -sPAPERSIZE=a4 -sOutputFile=output_%04d.tiff foo.pdf

(foo represents the file name) This would open up the pdf file and perform all the necessary conversions (from pdf to 300dpi 1-bit BW A4 sized tiff images)

3. Run GIMP, in the "open" window, go to the subdirectory, highlight 5-10 tiff files at a time and click "open". This way reduces numbers of required mouse activities (the limiting factor for processing speed)

4. Erase the logo with the eraser and close the file, make sure you click "save" when it asks you whether you want to save.

You only have to set up the eraser once and from now on, it takes *three clicks* (including the eraser action) to process each page, thus decreasing processing time of each file to 5 seconds! Smile

5. at command prompt, change to the subdirectory and concatenate all processed images by running:
tiffcp -c g4 *.tiff output.tiff

6. finally we convert output.tiff back to pdf and send the pdf file back to the parent directory by:
tiff2pdf output.tiff > ../foo.pdf
Yagan Kiely
Site Admin
Posts: 1139
Joined: Sun Jan 14, 2007 8:16 am
notabot: YES
notabot2: Bot
Location: Perth, Australia
Contact:

Post by Yagan Kiely »

4. Erase the logo with the eraser and close the file, make sure you click "save" when it asks you whether you want to save.

You only have to set up the eraser once and from now on, it takes *three clicks* (including the eraser action) to process each page, thus decreasing processing time of each file to 5 seconds! Smile
I only got GIMP 5 days ago. How would I do this?
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Post by kcleung »

Yagan Kiely wrote:
4. Erase the logo with the eraser and close the file, make sure you click "save" when it asks you whether you want to save.

You only have to set up the eraser once and from now on, it takes *three clicks* (including the eraser action) to process each page, thus decreasing processing time of each file to 5 seconds! Smile
I only got GIMP 5 days ago. How would I do this?
Assume you use gimp 2.6, first you open the files as described previously. After files are opened, go to the toolbox pane, there is an eraser at the left column 5th from up-down. Click the eraser, then set the brush to the right size and you can now start erasing stuff on the image! If you make a mistake, on the menu of the image window, go to edit->undo and you are out of trouble!

After you are happy with the image, you can close the image. It will ask you whether to save, click "yes".
Yagan Kiely
Site Admin
Posts: 1139
Joined: Sun Jan 14, 2007 8:16 am
notabot: YES
notabot2: Bot
Location: Perth, Australia
Contact:

Post by Yagan Kiely »

Ooh I thought there was a way to save the key/mouse strokes/clicks so as to only do something once and to apply it to each image therein.

I know how to do that! I'm not that retarded! Honestly!...
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Post by kcleung »

Yagan Kiely wrote:Ooh I thought there was a way to save the key/mouse strokes/clicks so as to only do something once and to apply it to each image therein.

I know how to do that! I'm not that retarded! Honestly!...
the trouble is that logos can be in different positions in different pages, so you really can't automate the eraser action, although once you set the eraser as the chosen tool and set the brush size, the system will remember these settings and for each pages, you just do

erase -> close -> save

three clicks :)
Post Reply