[lug] Those Pesky Newlines

Tkil tkil at scrye.com
Thu Dec 26 23:13:37 MST 2002


>>>>> "Peter" == Peter Hutnick <peter-lists at hutnick.com> writes:

Peter> Is there an obvious way to strip "line wrap" newlines while
Peter> leaving "paragraph separating" newlines intact?  IOW, I need to
Peter> strip all the newlines out of a file *except* ones on their own
Peter> line and ones that are on a line with text that is followed by
Peter> a line that consists only of a newline.

Peter> I'm partial to sed for this purpose, but if someone can do it
Peter> in a Perl* one-liner I'd be just as grateful.

Peter> The task at hand, in case anyone cares, is converting a Free
Peter> book ("Think Like a Computer Scientist") to weasel readers
Peter> ztext format.  Paragraphs need to be intact, but it needs to do
Peter> its own wrapping.

Peter --

I use an emacs macro for this, to convert variously-formatted "plain
text" (including some of the microsoft charset evils) into the same
format you're looking for above.  For a similar reason -- I use
"txt2pdbdoc" to create documents for AportisDoc on my Palm system.

Anyway, here it is:

(defun tony-cleanup ()
  (interactive)
  (save-excursion
    (if (y-or-n-p "Remove '^M's? ")
        (progn
          (goto-char (point-min))
          (replace-string "
" "")))
    (if (y-or-n-p "Remove trailing whitespace? ")
        (progn
          (goto-char (point-min))
          (replace-regexp "[ \t]+$" "")))
    (if (y-or-n-p "Fix HTML entities?")
        (progn
          (goto-char (point-min))
          (replace-regexp "<" "<")
          (goto-char (point-min))
          (replace-regexp ">" ">")
          (goto-char (point-min))
          (replace-regexp "&" "&")
          (goto-char (point-min))
          (replace-regexp """ "\"")))
    (if (y-or-n-p "Fix high-bit quotes?")
        (progn
          (goto-char (point-min))
          (replace-regexp "‘" "\"")
          (goto-char (point-min))
          (replace-regexp "’)" "'")
          (goto-char (point-min))
          ;; (replace-regexp "s’" "s'")
          ;; (goto-char (point-min))
          (replace-regexp "’" "'")

          (goto-char (point-min))
          ;; (replace-regexp "“\\|”" "\"")
          (goto-char (point-min))
          (replace-regexp "“" "``")
          (goto-char (point-min))
          (replace-regexp "”" "''")
-------------- next part --------------

          (goto-char (point-min))
          (replace-regexp "?" "'")
          (goto-char (point-min))
          (replace-regexp "?" "\"")
          (goto-char (point-min))
          (replace-regexp "?" "\"")
          (goto-char (point-min))
-------------- next part --------------
          (replace-regexp "‹" "--")
          (goto-char (point-min))
          (replace-regexp "…" "...")
          ))
    (if (y-or-n-p "Remove leading whitespace? ")
        (progn
          (goto-char (point-min))
          (replace-regexp "^[ \t]+" "")))
    (if (y-or-n-p "Try to find broken lines? ")
        (let ((case-fold-search nil))
          (goto-char (point-min))
          (replace-regexp "\n\n+\\([a-z]\\)" "\n\\1")))
    (if (y-or-n-p "Attempt to find beginning of paragraphs? ")
        (progn
          (goto-char (point-min))
          (replace-regexp "\n+[ \t]+" "\n\n\t")))
    (if (y-or-n-p "Remove leading '>'?")
        (progn
          (goto-char (point-min))
          (replace-regexp "^>" "")))
    (if (y-or-n-p "Attempt to add inter-paragraph spacing?")
        (progn
          (goto-char (point-min))
          (replace-regexp "\\s.$" "\\&\n")
          (goto-char (point-min))
          (replace-regexp ",\n\n" ",\n")))
    (if (y-or-n-p "Delete excess blank lines? ")
        (progn
          (goto-char (point-min))
          (replace-regexp "\n\n\n+" "\n\n")))
    (if (y-or-n-p "Try to fix broken quoting?")
        (progn
          (goto-char (point-min))
          (let ((case-fold-search nil))
            (query-replace-regexp "\\<R\\([^S]+\\)S\\>" "\"\\1\"")
            (query-replace-regexp "U\\([a-z]+\\)\\>" "'\\1"))))
    (if (y-or-n-p "Fill region? ")
        (progn
          (fill-region (point-min) (point-max))))))

The translation of the above to perl should be straightforward;
remember that elisp regexes are "proper" double-quoted strings, thus
all the doubled backslashes.

Not that it has anything *at all* to do with why I wrote the above,
but the people who run (used to run?) alt.sex.stories.moderated have
some sort of script that tries to clean up text to a predictable
format.  If you can find that, it should be easy to take its output
and turn it into what you're looking for.

Posting from the DNA Lounge (www.dnalounge.com),
t.


More information about the LUG mailing list