Removing text from PDFs - Copying other archives

General help on the Wiki

Moderators: kcleung, Wiki Admins

Post Reply
bobnotts
Posts: 13
Joined: Thu Jul 12, 2007 1:17 pm
notabot: YES
notabot2: Bot
Location: Sheffield, UK
Contact:

Removing text from PDFs - Copying other archives

Post by bobnotts »

I've just discovered Art Song Central and I want to copy all the PDFs over to IMSLP. Can someone explain how to remove the text at the bottom of the PDF pages? I have Acrobat 6.0. Alternatively, is someone willing to remove this text themselves? I couldn't find a help file on this which is why I'm asking here. I'd be happy to write a help page if/when I work this process out!

Cheers, Rob
horndude77
active poster
Posts: 293
Joined: Sun Apr 23, 2006 5:08 am
notabot: YES
notabot2: Bot
Location: Phoenix, AZ

Post by horndude77 »

This is usually a manual process of extracting the images from the PDF then removing the offending block with an image editor. If it's in the same place consistently then it can all probably be automated. I'm sure imagemagick can blank out sections of an image. (If it's not in a consistent place we need some computer vision expert to help us :) ) In the past I've used the unix command pdfimages to extract images, but with that you lose the dpi information. So if you get something written up on the easiest way to do this it would be very useful.
Lyle Neff
active poster
Posts: 702
Joined: Wed Mar 14, 2007 3:21 pm
notabot: 42
notabot2: Human
Location: Delaware, USA
Contact:

Post by Lyle Neff »

I've used Acrobat 6.0 to extract images and then load each image into an image editor to remove extraneous material (mostly blotches, scratch-lines, or owner/library markings, and also the results of cropping).

Trouble is, after I do that, and then recompile the images into a new PDF, the file-size increases monstrously compared to the initial PDF scanning.
DANewman
Posts: 4
Joined: Fri Nov 02, 2007 8:02 pm
notabot: YES
notabot2: Bot
Contact:

Re: Removing text from PDFs

Post by DANewman »

bobnotts wrote:I've just discovered Art Song Central and I want to copy all the PDFs over to IMSLP. Can someone explain how to remove the text at the bottom of the PDF
I would find that rather rude, considering the amount of work I've put into developing the site. Besides which, not all the songs on Art Song Central are public domain in Canada.

If you believe in fostering the dissemination of public domain sheet music, work in conjunction with the other people who are already doing it instead of trying to leech off their efforts.

David Newman
DANewman
Posts: 4
Joined: Fri Nov 02, 2007 8:02 pm
notabot: YES
notabot2: Bot
Contact:

A more constructive effort

Post by DANewman »

If you really want to do something constructive, look at the source index I've created at Art Song Central. Quite a few of the sources I list are already available online, but are less than ideal for the end user.

In particular, the books that have been scanned at Google Books or the Internet Archive are often very large downloads for a user that may be looking for a single piece. Google's PDF files are also very hard for older computers to process, as are IA's DJVU files, and IA's B&W PDF files are often incorrectly thresholded.

Work on splitting up those files into individual songs, and you'll be helping IMSLP and Art Song Central at the same time. And you'll be helping all our patrons access this music more easily, instead of just transferring files from one free location to another.

I've done some of this myself, and can help you with the process if you're interested.
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Post by Carolus »

Google does some very odd things in their scanning that I've noticed. One, which is a real pain, is that they have random pages - or even a single random systems on pages with more than one - scanned as a grayscale JPEG at 150dpi, while the rest of the score is 600dpi monochrome.

Some of the IA items prepared by Microsoft have the corporate logo pasted on every single page - often right in the music. Others have attempted to put in Chord Symbols in hidden text above the top line in vocal / piano scores. They will not be viewable at all for anyone with an older version of Adobe Reader, and the logo stuck in the middle of music is a real pain if you print out and attempt to use some pages. I wonder if it's sort of a back-door attempt to claim proprietary interest in a public domain item (a scan of a public domain music score).
aldona
active poster
Posts: 385
Joined: Mon Apr 16, 2007 11:09 pm
notabot: 42
notabot2: Human
Location: Melbourne, Australia

Post by aldona »

If you believe in fostering the dissemination of public domain sheet music, work in conjunction with the other people who are already doing it instead of trying to leech off their efforts.
With all due respect, I get the impression that Rob had no malevolent intentions; he may well have come straight from the "Guide to Contributing Scores" where he would have read the following:
Public domain scans of composers' works can be found on several websites. One of the relatively major goals of IMSLP is to be a centralized site for music scores, which means it is a good idea to submit scores from other public domain music score websites, if they are not on IMSLP already. Centralization improves the usefulness of the scans, since people can obtain them more easily. Please refer to our ever-expanding list of public domain music score websites.
I know I have used IMSLP as a kind of "Noah's Ark" for the gathering together and preservation of rare and endangered scores from all over the internet, and I have often "harvested" interesting scores from other PD websites (such as Sibley Music Library and the Danish Royal Library). (while acknowledging the original source as "scanner" - your point about linking addresses the same purpose.)

I also know how easy it is to get carried away with excitement when I stumble upon a gold mine of PD scores like the above mentioned sites.

Aldona
“all great composers wrote music that could be described as ‘heavenly’; but others have to take you there. In Schubert’s music you hear the very first notes, and you know that you’re there already.” - Steven Isserlis
DANewman
Posts: 4
Joined: Fri Nov 02, 2007 8:02 pm
notabot: YES
notabot2: Bot
Contact:

Post by DANewman »

aldona wrote:I know I have used IMSLP as a kind of "Noah's Ark" for the gathering together and preservation of rare and endangered scores from all over the internet, and I have often "harvested" interesting scores from other PD websites (such as Sibley Music Library and the Danish Royal Library). (while acknowledging the original source as "scanner" - your point about linking addresses the same purpose.)
I understand the desire to have a centralized resource where things can be protected from falling off the net. Please know that I am not going anywhere, and that if anything were to happen to my archive for any reason, I'd see its contents promptly transferred to another archive. I've been in the public domain preservation game for more than a decade, having put considerable work into Project Gutenberg before I shifted my focus to Art Song Central. I'm committed to the cause.

While the thought of having my entire site copied wholesale turns my stomach, the most offensive thing to me was the suggestion to remove the text at the bottom which tells people where they can find more of the same. If there are specific pieces one wishes to harvest from my site, I hope the link is offered and the text at the bottom left intact.

But as I pointed out above, there's so much real work to be done! There are so many scores unscanned, or files to be repurposed. I've also got dozens of books half-finished. I'd love help getting them to the usable stage. I encourage anyone wishing to expand the offerings in this genre to contact me!

I've read elsewhere in these forums that contributors are discouraged from copying files from CPDL. I would hope that attitude would also extend to other sites which have made a determined effort to assemble a significant body of work, and which have dedicated user bases.

Regardless, I will be linking like crazy to IMSLP, helping organize and annotate the material in a way that benefits my own users, who tend to be voice teachers and their students, and simultaneously driving traffic to IMSLP. I hope that contributors to IMSLP will respect the work I've been doing and continue to do, and work in a way that benefits us all.

Thanks,

David
bobnotts
Posts: 13
Joined: Thu Jul 12, 2007 1:17 pm
notabot: YES
notabot2: Bot
Location: Sheffield, UK
Contact:

Post by bobnotts »

Sorry for taking so long to reply to this. To be honest, I forgot about this post for some time! Anyhow, I hope to explain my reasoning here.

My rationale for removing the Art Song Central notice from the PDFs is that they would be hosted on IMSLP so it would seem rather strange to a person downloading a score from IMSLP to have a note about it being from ASC. Now - ideally, this wouldn't be necessary because I would link to ASC, not re-host the files on IMSLP (as has been happening on CPDL). However, I knew that IMSLP has a policy to only have scores indexed that reside on the IMSLP servers, a policy that I find slightly restrictive as a contributor, though I completely understand the rationale for it as an admin (on CPDL I have spent a lot of time fixing broken links and, where sites are down, wading through the Internet Archive to find editions linked to from CPDL to upload them onto the CPDL server). If IMSLP allowed external links, I would have approached you first, David, to establish a linking system but as it doesn't and the scans are wholly public domain, I saw no reason not to rehost as I saw no other possibility for making the works directly available to IMSLP users.

As Aldona has helpfully pointed out, this is exactly the sort of action which is requested of IMSLP contributors on the help pages (and I note that a number of scans have been copied directly from the Sheet Music Archive and no doubt other sources too. Someone had to scan those scores to start with, too). I admit that I was a little hasty just to post on here without really thinking the point through but it's also worth pointing out that the sheet music scans on ASC are 100% public domain (in the US, at least) so legally, anyone can do anything they like with them. Having said that, I have tried to put myself in your shoes, David, and I appreciate that you've put a lot of work into this site, so I'd probably be a bit annoyed too if I were you.
I've read elsewhere in these forums that contributors are discouraged from copying files from CPDL. I would hope that attitude would also extend to other sites which have made a determined effort to assemble a significant body of work, and which have dedicated user bases.
I think it's important to make the distinction here, between retypeset (and often edited to some degree or another) scores available on CPDL and the scanned scores available on ASC. Editorial additions to works can be copyrighted and the licenses used on IMSLP don't immediately seem to be compatible with the CPDL license. Copying files from CPDL to IMSLP which have substantial editorial content and which have been licensed under the CPDL license could potentially cause problems. However, copying public domain scans is always legal (assuming that the work is public domain in the other country, of course). It seems to me that this is the primary reason for not rehosting CPDL scores.
If you believe in fostering the dissemination of public domain sheet music, work in conjunction with the other people who are already doing it instead of trying to leech off their efforts.
To be quite frank, I don't see how I would be leeching off your efforts if I were to copy the ASC PDFs to IMSLP - it's of no benefit to me, only the thousands of people who use IMSLP every day! To turn the situation around, by this rationale, surely you are leeching off composers' efforts by making scans of their works available on your website?!! (I don't mean this, of course, but merely trying to illustrate a point...)
Regardless, I will be linking like crazy to IMSLP, helping organize and annotate the material in a way that benefits my own users, who tend to be voice teachers and their students, and simultaneously driving traffic to IMSLP. I hope that contributors to IMSLP will respect the work I've been doing and continue to do, and work in a way that benefits us all.
I think that's an admirable sentiment, and one which I also hold. As a singer, I recognise the value of a free resource solely devoted to singers and I will be in touch with David shortly to discuss how I might help contribute to ASC (in a very limited way) in the future, because I admire the project he's started so much.
DANewman
Posts: 4
Joined: Fri Nov 02, 2007 8:02 pm
notabot: YES
notabot2: Bot
Contact:

Post by DANewman »

I'm tempted to let this drop, but there's some pretty important issues at stake here. Understand that I respect the work you've done at CPDL, and I recognize that there were glimmers of an apology somewhere in what you just wrote. I apologize for the abruptness of my first response, though it seems you understand where I was coming from. However...
bobnotts wrote:I saw no other possibility for making the works directly available to IMSLP users.
First of all, IMSLP users have access to all the files I've put up... through Art Song Central. Right now, Art Song Central has gathered the most significant collection of free, printable, "classical" music for piano and voice on the web. There are numerous ways that IMSLP could highlight that fact and send its users to the source. Instead, no matter how you couch it, moving all of ASC's files to IMSLP is a direct attack on ASC's reason for existence. It only makes sense if your goal is simply to enhance the stature of IMSLP at the expense of ASC, not to enhance the availability of public domain music on the web.

There are lots of websites where someone has collected a few pieces and put them up. These sites may disappear and be a problem... I've already discovered a few links on ASC that have broken since I added them within the last year. I'm learning my lesson on that. However, I don't worry about broken links from CPDL, Mutopia or WIMA, because I know they'll be around. ASC is like this, too. It's not going anywhere. The links aren't going to break.
this is exactly the sort of action which is requested of IMSLP contributors on the help pages (and I note that a number of scans have been copied directly from the Sheet Music Archive and no doubt other sources too. Someone had to scan those scores to start with, too).
That contributors are requested to do that does not make it right, or beneficial. On one hand, there is legitimate reason to copy scans from a source like SMA, since they heavily restrict access to their content, or from personal sites that offer a few things but may disappear at any time. But ASC won't restrict its content and won't disappear. (Legal threats aren't a problem as they were for IMSLP, because everything is posted in accordance with Project Gutenberg's well defended policies.) So copying files from the site generally gains nothing except, again, to enhance the stature of IMSLP at the expense of ASC. I think it's pretty fair to call that leeching. I would like to see IMSLP amend this policy.

(And, removing the unobtrusive note from the bottom of each PDF file only gives additional credence to the impression that this is about site stature and not adding value.)

I've invested at least 600 hours building ASC into what you see today, and many more in the previous decade as I gathered scores and searched for a way to implement it. Much of the work is half done. I have many books that have been scanned, but still need to be cleaned and organized, converted to PDF, researched and posted. Eventually, I want the most popular songs transcribed, and have begun that work. Want to help people have access to these resources? Then come help me do the work of providing them instead of diminishing the value of the site I built for them.

My fear is that there is already a culture created at IMSLP that values the site above the mission. I've seen this at Wikipedia as bands of IMSLP volunteers go through articles removing entries for other websites and replacing them with IMSLP. (And planning and discussing it in the forum...)

Ironically, much of that work seemed to be fired not just by a zeal for dominance, but by an anti-commercial agenda which now seems rather ridiculous in light of the bold advertising which is omnipresent on this site. (Will IMSLP then follow where others have gone before and offer ad-free browsing for a fee?) That said, I do understand the need to raise money, and I'm hoping that ASC will one day bring in enough to pay for itself...

I beg you, knowing that your heart is in the right place, to value the mission above the site. Keep up the good work you do at CPDL, and help foster an environment at IMSLP that encourages cooperation and not competition.
As a singer, I recognise the value of a free resource solely devoted to singers and I will be in touch with David shortly to discuss how I might help contribute to ASC (in a very limited way) in the future, because I admire the project he's started so much.
I appreciate that, and look forward to talking to you.
Yagan Kiely
Site Admin
Posts: 1139
Joined: Sun Jan 14, 2007 8:16 am
notabot: YES
notabot2: Bot
Location: Perth, Australia
Contact:

Post by Yagan Kiely »

I'm not really getting into any argument/discussion/whatever but...
not to enhance the availability of public domain music on the web.
Having scores spattered all over the internet does not enhance the availability of PD scores.
So copying files from the site generally gains nothing except, again, to enhance the stature of IMSLP at the expense of ASC. I think it's pretty fair to call that leeching. I would like to see IMSLP amend this policy.
As above, having all PD files in one spot rather than linking to every website (and having to learn layouts and site maps of every site) is not to enhance the stature of IMSLP, it is to enhance availability of PD. IMSLP doesn't need enhancing at any rate, it is already the primary source of PD scores and with the largest collection.
(And, removing the unobtrusive note from the bottom of each PDF file only gives additional credence to the impression that this is about site stature and not adding value.)
You have to remember, you will be credited with it regardless of logos.
(And planning and discussing it in the forum...)
I've been here a while and never seen that, please back this up with an example.
Ironically, much of that work seemed to be fired not just by a zeal for dominance, but by an anti-commercial agenda which now seems rather ridiculous in light of the bold advertising which is omnipresent on this site.
You are very negative about IMSLP without any reason to be. There is no zeal for dominance, and we are still anti-commercial. I don't know what you think, but 'bold' adverts does not make this place commercial... at all.
(Will IMSLP then follow where others have gone before and offer ad-free browsing for a fee?)
There is absolutely no reason to do this and it will never happen.
not competition.
This is not true, IMSLP does not compete with other places, all it does is strive to collect all PD music into one spot - legally.
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Post by Carolus »

A couple of observations:

1. We really don't need to be duplicating the content of ASC. For one thing, there are a number of things there which are simply not legal in Canada - Poulenc for starters. Another is that ASC is (understandably) focused on vocal repertoire, while IMSLP is more general in nature. ASC also seems to be specifically designed with singers in mind, while IMSLP's interface is not so specific. While it's only natural to have some duplication between the two, there's not a real need for a "mirroring" of ASC. I imagine that anyone interested in vocal music would be visiting both IMSLP and ASC.

2. The Amazon links at IMSLP are not ads, any more than the "omnipresent" Amazon, Sheet Music Plus, and MusicNotes links at ASC are. It's a little odd for one who lives in a glass cathedral to hurl such stones, don't you think? Servers and bandwidth are not free. Neither are lawyers. Although he made many bogus assertions and unfounded accusations last fall, Mr. Irons (the PR guy for everyone's absolute favorite Austrian music publisher) did make one good point: There is no such thing as a free lunch. (Even a broken clock gives the correct time twice a day.)

3. If there should be a need to fill out a particular song-cycle at IMSLP, there's nothing wrong with including a copy of the file from ASC. Just be sure to list ASC as the scanner, preferablly using the following format [http://artsongcentral.com/ Art Song Central] so there's a direct link back to ASC. If DANewman is agreeable to this type of limited duplication, there is no real need to remove the ACS info from the file itself. We have a number of Mutopia files with the full Mutopia URL on the last page, yet we do not mirror the Mutopia site. DANewman is free to do likewise with IMSLP, of course, since there are a number of art songs by Russian composers here lacking at ASC which he might find to be of interest.
sembian
Posts: 1
Joined: Mon Jul 21, 2008 7:59 pm

Post by sembian »

Hi everyone,

My 2 cents about this debate:

Attribution is very important for a project such as IMSLP because of the contributory nature. This is also one of the reasons why there is no Creative Commons license without attribution. IMSLP does respect this in that the scanner is identified for scanned scores.

However, attribution is not the same as control. Giving control of a public domain scan to the scanner might very well invite unforeseen disasters. While it is certainly an understandable emotion on the part of the scanner, it also invites the worst form of copyright: perpetual and manual labor copyright. This is because, if manual labor like scanning warrants copyright protection and control over the product by the creator (which it currently does not under US law), then everything, regardless of how non-creative, would be controlled by its creator (or in other words, copyrighted). I believe that would be infinitely worse than simply hosting scans from other websites while giving attribution (if that is a bad thing at all, as I will show below). It seems to me that IMSLP is positioned especially to prevent such disasters... but unfortunately there is no major action that will satisfy every person on the planet.

I think the wide dissemination and distribution of public domain material is very important, and does not conflict with the centralization mentioned by previous posters. From the open letter I know that Feldmahler is working on a backup system, and this would be the type of dissemination that is very beneficial for public domain material. Centralization also seems to me to be important because, as was said before, it drastically improves the usefulness of the scores, since finding them become painless, and people know where to submit scores. Since IMSLP is based on the wiki principle, is also one of the largest music score libraries on the internet, and in addition offers backups (when it is completed, which I hope is soon), it would make sense for IMSLP to be a point of centralization.

But I do have to iterate that regardless of centralization or distribution, attribution is very important, and IMSLP should always give attribution to the fullest extent possible.
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Post by Carolus »

Thank you, sembain, for raising this important aspect of IMSLP's mission: Keeping the Public Domain - Public Domain. That is one of the reasons we should have no problem whatsoever with Art Song Central and other sites of such nature. They are not competitive, but complimentary - with IMSLP's mission.

Preserving the public domain can be achieved by the employment of two seemingly contradictory methods: centralization (IMSLP) and dispersal (other specialized sites like ASC, Mutopia, and other archives). Having copies of the files located in many places helps ensure against their loss in the event of disasters ranging from legal actions from insane music publishers to floods, earthquake and fire. At the same time, a central archive like IMSLP makes access a snap for anyone looking for this material. The nice thing about having selected ASC files at IMSLP is that the person looking for something can then bounce over to ASC for even more specialized info.
Johann Casper Ferdinand F
regular poster
Posts: 32
Joined: Thu Jul 24, 2008 12:00 am

Post by Johann Casper Ferdinand F »

One of the things we should all remember is that we all fundamentally support free content and support the right of the public to use public domain material without restrictions. Trying to assert control or limit reuse of material that is fairly and legally in the public domain is counter to the core values of the open content movement. The "I scanned it, therefore I control the use of the scan" argument is one that many libraries, archives, and museums use both for revenue generation and control.

IMSLP, Wikipedia, Mutopia, CPDL, and similar projects rely upon the fact that public domain material can be copied without restrictions. We all depend on the fact that possession of an original work (e.g. a manuscript) does not confer control over derivative uses. We depend upon the fact that scanning, reprinting, and republication do not confer new rights.

Source attribution beyond that appropriate to the medium can and should be removed. At IMSLP, we provide attribution for the source edition, the scanner, and the uploader. It is neither necessary nor appropriate to provide such attribution on each page of a score, and it is distracting or problematic in many reuse situations.

If you love something, set it free.
Post Reply