Skip to content

从 PDF 中提取图片和文字

Updated: at 01:12 AMSuggest Changes

从 PDF 中提取文字

官方文档:https://mozilla.github.io/pdf.js/api/draft/module-pdfjsLib-PDFPageProxy.html

http://www.srcmini.com/61995.html

/**
 * Retrieves the text of a specif page within a PDF Document obtained through pdf.js
 *
 * @param {Integer} pageNum Specifies the number of the page
 * @param {PDFDocument} PDFDocumentInstance The PDF document obtained
 **/
function getPageText(pageNum, PDFDocumentInstance) {
  // Return a Promise that is solved once the text of the page is retrieven
  return new Promise(function (resolve, reject) {
    PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
      // The main trick to obtain the text of the PDF page, use the getTextContent method
      pdfPage.getTextContent().then(function (textContent) {
        var textItems = textContent.items;
        var finalString = '';

        // Concatenate the string of the item to the final string
        for (var i = 0; i < textItems.length; i++) {
          var item = textItems[i];

          finalString += item.str + ' ';
        }

        // Solve promise with the text retrieven from the page
        resolve(finalString);
      });
    });
  });
}

从 PDF 中提取图片

https://github.com/mozilla/pdf.js/issues/13541 中看到这段代码,稍微改动一下

it('gets operatorList with JPEG image (issue 4888)', async function () {
  const loadingTask = getDocument(buildGetDocumentParams('cmykjpeg.pdf'));

  const pdfDoc = await loadingTask.promise;
  const pdfPage = await pdfDoc.getPage(1);
  const operatorList = await pdfPage.getOperatorList();

  const imgIndex = operatorList.fnArray.indexOf(OPS.paintImageXObject);
  const imgArgs = operatorList.argsArray[imgIndex];
  const { data } = pdfPage.objs.get(imgArgs[0]);

  expect(data instanceof Uint8ClampedArray).toEqual(true);
  expect(data.length).toEqual(90000);

  await loadingTask.destroy();
});
// first here I open the document
pdf.getDocument('solar.pdf').promise.then(async function (pdfObj) {
  // because I am testing, I just wanted to get page 7
  const page = await pdfObj.getPage(7);

  // now I need to get the image information and for that I get the operator list
  const operators = await page.getOperatorList();

  // this is for the paintImageXObject one, there are other ones, like the paintJpegImage which I assume should work the same way, this gives me the whole list of indexes of where an img was inserted
  const rawImgOperator = operators.fnArray
    .map((f, index) => (f === pdf.OPS.paintImageXObject ? index : null))
    .filter((n) => n !== null);

  // now you need the filename, in this example I just picked the first one from my array, your array may be empty, but I knew for sure in page 7 there was an image... in your actual code you would use loops, such info is in the argsArray, the first arg is the filename, second arg is the width and height, but the filename will suffice here
  const filename = operators.argsArray[rawImgOperator[0]][0];

  // now we get the object itself from page.objs using the filename
  page.objs.get(filename, async (arg) => {
    // and here is where we need the canvas, the object contains information such as width and height
    const canvas = ccc.createCanvas(arg.width, arg.height);
    const ctx = canvas.getContext('2d');

    // now you need a new clamped array because the original one, may not contain rgba data, and when you insert you want to do so in rgba form, I think that a simple check of the size of the clamped array should work, if it's 3 times the size aka width*height*3 then it's rgb and shall be converted, if it's 4 times, then it's rgba and can be used as it is; in my case it had to be converted, and I think it will be the most common case
    const data = new Uint8ClampedArray(arg.width * arg.height * 4);
    let k = 0;
    let i = 0;
    while (i < arg.data.length) {
      data[k] = arg.data[i]; // r
      data[k + 1] = arg.data[i + 1]; // g
      data[k + 2] = arg.data[i + 2]; // b
      data[k + 3] = 255; // a

      i += 3;
      k += 4;
    }

    // now here I create the image data context
    const imgData = ctx.createImageData(arg.width, arg.height);
    imgData.data.set(data);
    ctx.putImageData(imgData, 0, 0);

    // get myself a buffer
    const buff = canvas.toBuffer();

    // and I wrote the file, worked like charm, but this buffer encodes for a png image, which can be rather large, with an image conversion utility like sharp.js you may get better results by compressing the thing.
    fs.writeFile('test', buff);
  });
});

另外一种提取图片的方法

先转换成 svg,然后再提取 svg 中的图片

function element_list(el, depth) {
  for (var i = 0; i < el.children.length; i++) {
    const element = el.children[i];
    // nodeName: "svg:image"
    if (element.nodeName === 'svg:image') {
      const getImage = element.getAttribute('xlink:href');
      console.log(getImage);
      console.dir(getImage);
    }
    element_list(el.children[i], depth + 1);
  }
}
page
  .getOperatorList()
  .then((opList) => {
    var svgGfx = new PDFJS.SVGGraphics(page.commonObjs, page.objs);
    return svgGfx.getSVG(opList, viewport);
  })
  .then((svg) => {
    element_list(svg, 0);
  });

参考文章


Previous Post
Vue 约定式路由(文件路由)
Next Post
CSS 设置背景颜色透明