python之给pdf添加页码

最近写release note, 总感觉用tex不太方便,特别是装texlive占用大量空间,还有各种依赖问题,想着能不能用markdown写更方便。实践证明,typora导出pdf的功能真的很棒,唯独一个不足之处就是生成的PDF不带页码

这个虽然可以使用在线工具实现,或者使用Adobe、福昕的 pdf 编辑功能,但是很多情况,尤其是工作平台是不方便使用的。为此我想到了Python,通过脚本把页码加上。

安装Python库

首先需要安装两个依赖库,PyPDF2以及reportLab, PyPDF2可以对PDF进行拆分、合并、删除、加密等操作;reportlab则更是强大,看看下面的官方介绍。

We build solutions to generate rich, attractive and fully bespoke PDFs at incredible speeds.
Over 5 million documents are generated each month using Reportlab's software
--- https://www.reportlab.com/

sudo pip3 install pypdf2
sudo pip3 install reportlab

代码实现

添加页码思路如下:

  1. 使用PyPDF2打开待添加页码的PDF, 记录总页数
  2. 使用reportlab创建只带有页码的临时PDF,页面数量与待修改文件一致
  3. 将临时PDF和待修改PDF合并在一起
  4. 保存合并以后的PDF文件

具体实现如下:

创建临时PDF

使用reportlab创建PDF,每个页面对应一个画布canvas,创建页面的过程就像是在画布上画画,把页码画在指定的位置,A4纸张210mm*297mm,以画布左下角为原点,那么页码的坐标大概是 (210/2-1, 4) = (104, 4), 单位mm

下面的函数就是在已知总页面数的情况下逐页绘制页码,然后保存至tmp文件中。

from reportlab.lib.units import mm
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont


def create_pdf_with_pagenumber(tmp, num):
    '''create tmp pdf that only include page number'''
    pdfmetrics.registerFont(
        TTFont('Times-New-Roman', 'C:\\Windows\\Fonts\\times.ttf'))
    c = canvas.Canvas(tmp)
    for i in range(num):
        c.setFont('Times-New-Roman', 10)
        c.drawString((104)*mm, (4)*mm, str(i + 1))
        c.showPage()
    c.save()

合并PDF

合并PDF主要用到的是PyPDF2的PdfFileWriter, PdfFileReader, 一个写,一个读。合并过程很简单,打开文件,逐页读取(getPage),使用mergePage合并两个PDF的对应页面,然后将合并后的页面写入(addPage)到输出文件中。

from PyPDF2 import PdfFileWriter, PdfFileReader

path = 'release_notes.pdf'
tmp = "tmp.pdf"

dst_pdf = PdfFileWriter()
with open(path, 'rb') as f:
    src_pdf = PdfFileReader(f, strict=False)
    n = src_pdf.getNumPages()
    create_pdf_with_pagenumber(tmp, n)

    with open(tmp, 'rb') as ftmp:
        num_pdf = PdfFileReader(ftmp)
        for i in range(n):
            print('page: %d of %d' % (i+1, n))
            page = src_pdf.getPage(i)
            num_layer = num_pdf.getPage(i)

            page.mergePage(num_layer)
            dst_pdf.addPage(page)

    if dst_pdf.getNumPages():
        output = '{}_new.pdf'.format(path.split('.')[0])
        with open(output, 'wb') as f:
            dst_pdf.write(f)

    os.remove(tmp)

完整代码

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'''add page number to pdf file'''

import sys
import os

import reportlab
from reportlab.lib.units import mm
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

from PyPDF2 import PdfFileWriter, PdfFileReader


def create_pdf_with_pagenumber(tmp, num):
    '''create tmp pdf that only include page number'''
    pdfmetrics.registerFont(
        TTFont('Times-New-Roman', 'C:\\Windows\\Fonts\\times.ttf'))
    c = canvas.Canvas(tmp)
    for i in range(num):
        c.setFont('Times-New-Roman', 10)
        c.drawString((104)*mm, (4)*mm, str(i + 1))
        c.showPage()
    c.save()


def main():
    path = 'release_notes.pdf'
    if len(sys.argv) == 1:
        if not os.path.isfile(path):
            sys.exit(1)
    else:
        path = os.path.basename(sys.argv[1])

    tmp = "tmp.pdf"
    dst_pdf = PdfFileWriter()
    with open(path, 'rb') as f:
        src_pdf = PdfFileReader(f, strict=False)
        n = src_pdf.getNumPages()
        create_pdf_with_pagenumber(tmp, n)

        with open(tmp, 'rb') as ftmp:
            num_pdf = PdfFileReader(ftmp)
            for i in range(n):
                print('page: %d of %d' % (i+1, n))
                page = src_pdf.getPage(i)
                num_layer = num_pdf.getPage(i)

                page.mergePage(num_layer)
                dst_pdf.addPage(page)

        if dst_pdf.getNumPages():
            output = '{}_new.pdf'.format(path.split('.')[0])
            with open(output, 'wb') as f:
                dst_pdf.write(f)

        os.remove(tmp)


if __name__ == "__main__":
    main()

使用很简单

python3 main.py filename.pdf

reference