Skip to content

pdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

License

Notifications You must be signed in to change notification settings

lamoglia/pdf2html

 
 

Repository files navigation

pdf2html

NPM version npm module downloads Build Status view on npm

pdf2html helps to convert PDF file to HTML or Text using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

Installation

via yarn:

yarn add pdf2html

via npm:

npm install --save pdf2html

Java runtime environment (JRE) is required to run this module.

Usage

const pdf2html = require('pdf2html')

pdf2html.html('sample.pdf', (err, html) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(html)
    }
})

Convert to text

pdf2html.text('sample.pdf', (err, text) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(text)
    }
})

Convert as pages

pdf2html.pages('sample.pdf', (err, htmlPages) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(htmlPages)
    }
})
const options = { text: true }
pdf2html.pages('sample.pdf', options, (err, textPages) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(textPages)
    }
})

Extra metadata

pdf2html.meta('sample.pdf', (err, meta) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(meta)
    }
})

Generate thumbnail

pdf2html.thumbnail('sample.pdf', (err, thumbnailPath) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(thumbnailPath)
    }
})
const options = { page: 1, imageType: 'png', width: 160, height: 226 }
pdf2html.thumbnail('sample.pdf', options, (err, thumbnailPath) => {
    if (err) {
        console.error('Conversion error: ' + err)
    } else {
        console.log(thumbnailPath)
    }
})

Manually download dependencies files

Sometimes downloading the dependencies might be too slow or unable to download in a HTTP proxy environment. Follow the step below to skip the dependency downloads.

cd node_modules/pdf2html/vendor
# These URLs come from https://github.com/shebinleo/pdf2html/blob/master/postinstall.js#L6-L7
wget https://dlcdn.apache.org/pdfbox/2.0.26/pdfbox-app-2.0.26.jar
wget https://dlcdn.apache.org/tika/2.4.0/tika-app-2.4.0.jar

About

pdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 100.0%