Program to strip CDSM logos

Advice and Help

Moderator: kcleung

Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Program to strip CDSM logos

Post by Mazin »

A colleague and I cooked up a program to strip CDSM logos only from the CDSM scans.
Written in C++, it can detect and white-out two different size logos: the standard logo and a slightly smaller version. It's decently fast. Using a quad-core processer, I was able to process 61 pages in 23 secs.
Source code is provided as GPL. Download from
http://files.aztekera.com/music/magic20090807.zip

Compare
ftp://imslp.org/logo_infested/CDSM/Cell ... 0cello.pdf

to the automated result
http://files.aztekera.com/music/bach.tiff

But what's that you say? My program can't even glob filenames? That's why I also cooked up parallelize.pl, a perl script that can parallelize any terminal command!

parallelize.pl

Code: Select all

#!/usr/bin/perl
#
# I, Eric Jiang, hereby release this script into the public domain.
#
# This script requires the Parallel::ForkManager module to run.
#
use Parallel::ForkManager;

if(scalar(@ARGV) < 3) {
	print <<USAGE;
USAGE: perl parallelize.pl num cmd glob

"GLOB" is substituted with the name of each file

Example:
	parallelize.pl 4 "mogrify -negate GLOB" "lol*.pbm"
USAGE
exit;
}

my @globs = <$ARGV[2]>;
my $pm = new Parallel::ForkManager($ARGV[0]);

for (my $i = 0; $i < scalar(@globs); $i++) {
	$pm->start and next;

	my $command = $ARGV[1];
	my $currfile = $globs[$i];
	$command =~ s/GLOB/$currfile/;

	print($command . "\n");
	system($command);

	$pm->finish;
}
$pm->wait_all_children;
print "Finished ".scalar(@globs). " tasks.\n";
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

Oh, and some usage notes for my and others' future reference.

Given cdsmrm and parallelize.pl in $PATH, the workflow is something like:

Extract images from PDF: (caution: some PDFs are not so simple, such as where CDSM retypeset pages!)

Code: Select all

pdfimages BestSongEver.pdf bse
This extracts all images to bse-000.pbm, bse-001.pbm, bse-002.pbm, etc. Then, remove the CDSM logo from each page parallelized for a six-core (6 processor) machine:

Code: Select all

parallelize.pl 6 "cdsmrm GLOB" "bse-*.pbm"
Convert images to TIFF in parallel, and compress in the process:

Code: Select all

parallelize.pl 6 "mogrify -format tiff -compress Group4 GLOB" "bse-*.pbm"
Compile the TIFFs into one TIFF:

Code: Select all

tiffcp bse-*.pbm BestSongEver_cleaned.tiff
Optionally convert to PDF:

Code: Select all

tiff2pdf BestSongEver_cleaned.tiff > BestSongEver_cleaned.pdf
Notice that there's no human intervention (although I suggest proofreading!), so theoretically, you could put all this in a batch script and run the batch script against a glob, say, *.pdf, come back in the morning, and proofread to your heart's content.

BTW, there's no guarantee that my program won't detect the logo to be in the center of the page and create a large white rectangle in the middle of a system. Hasn't happened to me yet, but just sayin'...
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

Booted into Windows, so now I have a win32 build:
http://files.aztekera.com/music/cdsmrm20090807win32.zip

Should run without any additional magic. However, you cannot drag'n'drop multiple files onto the exe on account my failure to include globbing. For globbing, you can use parallelize.pl from my first post, or use the linear-only cdsmrm.bat (same directory as cdsmrm.exe, and you can drag'n'drop multiple PBMs onto it):

cdsmrm.bat

Code: Select all

for %%i in (%*) do cdsmrm.exe %%i
daphnis
Copyright Reviewer
Posts: 1634
Joined: Thu May 17, 2007 7:15 pm
notabot: 42
notabot2: Human

Re: Program to strip CDSM logos

Post by daphnis »

Can you describe (in prose) how its detection method works?
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

daphnis wrote:Can you describe (in prose) how its detection method works?
Naively. It looks for a hardcoded linear pattern that should only match the logo. Since the logos were added after they scanned them, it works (as far as I can tell).
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Re: Program to strip CDSM logos

Post by Carolus »

Oustanding. Have you done any complete scores of those resident on the FTP server?
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

Blah. Rewrote the search algorithm to be less dumb, because apparently I wasn't paying attention in my algorithms course or we didn't cover string searching. It's now three times faster.

New version is at http://files.aztekera.com/music/magic20090809.zip
Carolus wrote:Oustanding. Have you done any complete scores of those resident on the FTP server?
Only two right now (one is linked to in my first post), since I'm still testing. Umm... bursting and then creating a new PDF loses all of the bookmarks and metadata. What should I do about those? Is there a way to export bookmarks and metadata from the original PDF and then import them into the new PDF?
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

And what's the system for keeping track of which ones need to be processed and which ones are cleaned? Is there a "cleaned" directory I need to copy/move them to or something like that?
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

So using a script that basically does the procedure outlined in my second post, my computer processed the contents of "CelloCD Works" (about 135 PDFs... there is going to be a very happy cellist soon :lol:) in about 30 minutes. However, it's not problem-free. Once in a while, a lone page will be exported as a PPM instead of a PBM, and thus be left out of the cleaned PDF (usually the first page. happened about 2 or 3 times.), but the presence of a PPM file is a dead giveaway. A bigger issue is in cases where CDSM retypeset entire pages to put the cello part by itself instead of with piano:

ftp://imslp.org/logo_infested/CDSM/Cell ... 0piano.pdf
ftp://imslp.org/logo_infested/CDSM/Cell ... 0Piano.pdf

Again, the presence of PPM files indicates that things were omitted.

Sorry for all the questions, but I still don't know what to do with these files now.
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

buuuuuuuuuump...
imslp
Site Admin
Posts: 1642
Joined: Thu Jan 01, 1970 12:00 am

Re: Program to strip CDSM logos

Post by imslp »

If there isn't a logo_cleaned or similar directory, you can create one. :-)
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

Any guidelines or best practices to keeping bookmarks and/or metadata? Or do we not care?
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Re: Program to strip CDSM logos

Post by Carolus »

We don't like to keep metadata. The bookmarks are OK as long as there's nothing proprietary included. BTW, I downloaded your cleaned files and they came up as "damaged and unreadable". I use Acrobat Pro version 8 (Mac OS).
Mazin
regular poster
Posts: 44
Joined: Sun Jan 25, 2009 3:03 am

Re: Program to strip CDSM logos

Post by Mazin »

Carolus wrote:We don't like to keep metadata. The bookmarks are OK as long as there's nothing proprietary included. BTW, I downloaded your cleaned files and they came up as "damaged and unreadable". I use Acrobat Pro version 8 (Mac OS).
Try again. I had an issue where my WiFi adapter would overheat from transmitting data and disconnect, so quite a few files were only partially uploaded. I then started the upload again from a hardwired computer set to replace any file for which the server-side and local file differed in size, which should have replaced all the damaged files. Let me know if any are still damaged!
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Re: Program to strip CDSM logos

Post by Carolus »

Just downloaded the Bach titles from the cleaned folder. Wonderful - fantastic job!! Now you'll have to try your hand at the volumes of the Digital Bach Edition which are there. There's the DBE logo to contend with, possibly some CDSM logos as well, as well as the ususal metatags, etc. If you like a real challenge, try taking on some Google scores. We have a number of those on the wiki already, with the Google logos removed. The problem is with those annoying random sections where a page or even a portion of a page are done at 150 dpi grayscale while the rest of the document is 600 dpi monochrome - much better for printing. The mixture of grayscale and monochrome on the same page causes some printers to crash when attempting to print these.
Post Reply