Turn pdfs into text files

Sometimes we want to study certain pdfs, but their organization and/or formatting is not appropriate. There are several ways of solving this problem, for instance, you can edit or extract the text.

There’s a program called pstotext, whose aim is to remove the text from PostScript files, through the interpreter of those files, called GhostScript.

The installation is very easy, as usually in this Blog:

Install pstotext.


Its use is also very easy. Write in the terminal:

pstotext -output final.txt original.pdf


original.pdf, is the file from which you want to withdraw the text and final.txt is the file which will be created. Note that if you don’t add the parameter -output, the final file won’t be created, but presented in the terminal.

There are other tricks that can be done with this program, for example, seeing the pdf file in the terminal (with the help of the programme less):

pstotext original.pdf | less


Removing a header from a pdf file, and saving it in a text file is also possible by using the grep programme, which according to the parameter searches the lines with that word:

pstotext original.pdf | grep -v "Copyright"> final.txt


Explanation: the grep -v parameter makes all lines appear except those which have the word "Copyright". The "> file.txt" makes what would appear in the terminal to be saved in final.txt file. Don’t forget that the command grep is Case Sensitive, therefore it distinguishes capital letters from small letters.

To search for a word in a pdf file just do:

pstotext original.pdf | grep "word"


To save the text of an Internet pdf file on disk, write:

wget http://name.of.the.site/original.pdf -O- | pstotext -output final.txt


This whole process is done in the terminal. To avoid this, I advise you to try the kword, which can open the pdf files, and then you can save in several file types such as .odt (same as .doc, but free), html, latEx, rtf, and many more.

Install kword.

Note that this program is for KDE, so in Gnome it's a bit slower opening.