tesseract 5.2.0
Loading...
Searching...
No Matches
pdfrenderer.cpp
Go to the documentation of this file.
1
2// File: pdfrenderer.cpp
3// Description: PDF rendering interface to inject into TessBaseAPI
4//
5// (C) Copyright 2011, Google Inc.
6// Licensed under the Apache License, Version 2.0 (the "License");
7// you may not use this file except in compliance with the License.
8// You may obtain a copy of the License at
9// http://www.apache.org/licenses/LICENSE-2.0
10// Unless required by applicable law or agreed to in writing, software
11// distributed under the License is distributed on an "AS IS" BASIS,
12// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13// See the License for the specific language governing permissions and
14// limitations under the License.
15//
17
18// Include automatically generated configuration file if running autoconf.
19#ifdef HAVE_CONFIG_H
20# include "config_auto.h"
21#endif
22
23#include "pdf_ttf.h"
24#include "tprintf.h"
25
26#include <allheaders.h>
27#include <tesseract/baseapi.h>
28#include <tesseract/renderer.h>
29#include <cmath>
30#include <cstring>
31#include <fstream> // for std::ifstream
32#include <locale> // for std::locale::classic
33#include <memory> // std::unique_ptr
34#include <sstream> // for std::stringstream
35#include "helpers.h" // for Swap
36
37/*
38
39Design notes from Ken Sharp, with light editing.
40
41We think one solution is a font with a single glyph (.notdef) and a
42CIDToGIDMap which maps all the CIDs to 0. That map would then be
43stored as a stream in the PDF file, and when flat compressed should
44be pretty small. The font, of course, will be approximately the same
45size as the one you currently use.
46
47I'm working on such a font now, the CIDToGIDMap is trivial, you just
48create a stream object which contains 128k bytes (2 bytes per possible
49CID and your CIDs range from 0 to 65535) and where you currently have
50"/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
51
52Note that if, in future, you were to use a different (ie not 2 byte)
53CMap for character codes you could trivially extend the CIDToGIDMap.
54
55The following is an explanation of how some of the font stuff works,
56this may be too simple for you in which case please accept my
57apologies, its hard to know how much knowledge someone has. You can
58skip all this anyway, its just for information.
59
60The font embedded in a PDF file is usually intended just to be
61rendered, but extensions allow for at least some ability to locate (or
62copy) text from a document. This isn't something which was an original
63goal of the PDF format, but its been retro-fitted, presumably due to
64popular demand.
65
66To do this reliably the PDF file must contain a ToUnicode CMap, a
67device for mapping character codes to Unicode code points. If one of
68these is present, then this will be used to convert the character
69codes into Unicode values. If its not present then the reader will
70fall back through a series of heuristics to try and guess the
71result. This is, as you would expect, prone to failure.
72
73This doesn't concern you of course, since you always write a ToUnicode
74CMap, so because you are writing the text in text rendering mode 3 it
75would seem that you don't really need to worry about this, but in the
76PDF spec you cannot have an isolated ToUnicode CMap, it has to be
77attached to a font, so in order to get even copy/paste to work you
78need to define a font.
79
80This is what leads to problems, tools like pdfwrite assume that they
81are going to be able to (or even have to) modify the font entries, so
82they require that the font being embedded be valid, and to be honest
83the font Tesseract embeds isn't valid (for this purpose).
84
85
86To see why lets look at how text is specified in a PDF file:
87
88(Test) Tj
89
90Now that looks like text but actually it isn't. Each of those bytes is
91a 'character code'. When it comes to rendering the text a complex
92sequence of events takes place, which converts the character code into
93'something' which the font understands. Its entirely possible via
94character mappings to have that text render as 'Sftu'
95
96For simple fonts (PostScript type 1), we use the character code as the
97index into an Encoding array (256 elements), each element of which is
98a glyph name, so this gives us a glyph name. We then consult the
99CharStrings dictionary in the font, that's a complex object which
100contains pairs of keys and values, you can use the key to retrieve a
101given value. So we have a glyph name, we then use that as the key to
102the dictionary and retrieve the associated value. For a type 1 font,
103the value is a glyph program that describes how to draw the glyph.
104
105For CIDFonts, its a little more complicated. Because CIDFonts can be
106large, using a glyph name as the key is unreasonable (it would also
107lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
108as the key. CIDs are just numbers.
109
110But.... We don't use the character code as the CID. What we do is use
111a CMap to convert the character code into a CID. We then use the CID
112to key the CharStrings dictionary and proceed as before. So the 'CMap'
113is the equivalent of the Encoding array, but its a more compact and
114flexible representation.
115
116Note that you have to use the CMap just to find out how many bytes
117constitute a character code, and it can be variable. For example you
118can say if the first byte is 0x00->0x7f then its just one byte, if its
1190x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
120have seen CMaps defining character codes up to 5 bytes wide.
121
122Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
123TrueType CIDFonts. The thing is that TrueType fonts are accessed using
124a Glyph ID (GID) (and the LOCA table) which may well not be anything
125like the CID. So for this case PDF includes a CIDToGIDMap. That maps
126the CIDs to GIDs, and we can then use the GID to get the glyph
127description from the GLYF table of the font.
128
129So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
130
131Looking at the PDF file I was supplied with we see that it contains
132text like :
133
134<0x0075> Tj
135
136So we start by taking the character code (117) and look it up in the
137CMap. Well you don't supply a CMap, you just use the Identity-H one
138which is predefined. So character code 117 maps to CID 117. Then we
139use the CIDToGIDMap, again you don't supply one, you just use the
140predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
141were supplied with only contains 116 glyphs.
142
143Now for Latin that's not a huge problem, you can just supply a bigger
144font. But for more complex languages that *is* going to be more of a
145problem. Either you need to supply a font which contains glyphs for
146all the possible CID->GID mappings, or we need to think laterally.
147
148Our solution using a TrueType CIDFont is to intervene at the
149CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
150font with just one glyph, the .notdef glyph at GID 0. This is what I'm
151looking into now.
152
153It would also be possible to have a 'PostScript' (ie type 1 outlines)
154CIDFont which contained 1 glyph, and a CMap which mapped all character
155codes to CID 0. The effect would be the same.
156
157Its possible (I haven't checked) that the PostScript CIDFont and
158associated CMap would be smaller than the TrueType font and associated
159CIDToGIDMap.
160
161--- in a followup ---
162
163OK there is a small problem there, if I use GID 0 then Acrobat gets
164upset about it and complains it cannot extract the font. If I set the
165CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
166mad......
167
168*/
169
170namespace tesseract {
171
172// If the font is 10 pts, nominal character width is 5 pts
173static const int kCharWidth = 2;
174
175// Used for memory allocation. A codepoint must take no more than this
176// many bytes, when written in the PDF way. e.g. "<0063>" for the
177// letter 'c'
178static const int kMaxBytesPerCodepoint = 20;
179
180/**********************************************************************
181 * PDF Renderer interface implementation
182 **********************************************************************/
183TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly)
184 : TessResultRenderer(outputbase, "pdf"), datadir_(datadir) {
185 obj_ = 0;
186 textonly_ = textonly;
187 offsets_.push_back(0);
188}
189
190void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
191 offsets_.push_back(objectsize + offsets_.back());
192 obj_++;
193}
194
195void TessPDFRenderer::AppendPDFObject(const char *data) {
196 AppendPDFObjectDIY(strlen(data));
197 AppendString(data);
198}
199
200// Helper function to prevent us from accidentally writing
201// scientific notation to an HOCR or PDF file. Besides, three
202// decimal points are all you really need.
203static double prec(double x) {
204 double kPrecision = 1000.0;
205 double a = round(x * kPrecision) / kPrecision;
206 if (a == -0) {
207 return 0;
208 }
209 return a;
210}
211
212static long dist2(int x1, int y1, int x2, int y2) {
213 return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
214}
215
216// Viewers like evince can get really confused during copy-paste when
217// the baseline wanders around. So I've decided to project every word
218// onto the (straight) line baseline. All numbers are in the native
219// PDF coordinate system, which has the origin in the bottom left and
220// the unit is points, which is 1/72 inch. Tesseract reports baselines
221// left-to-right no matter what the reading order is. We need the
222// word baseline in reading order, so we do that conversion here. Returns
223// the word's baseline origin and length.
224static void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1,
225 int word_x2, int word_y2, int line_x1, int line_y1, int line_x2,
226 int line_y2, double *x0, double *y0, double *length) {
227 if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
228 std::swap(word_x1, word_x2);
229 std::swap(word_y1, word_y2);
230 }
231 double word_length;
232 double x, y;
233 {
234 int px = word_x1;
235 int py = word_y1;
236 double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
237 if (l2 == 0) {
238 x = line_x1;
239 y = line_y1;
240 } else {
241 double t = ((px - line_x2) * (line_x2 - line_x1) + (py - line_y2) * (line_y2 - line_y1)) / l2;
242 x = line_x2 + t * (line_x2 - line_x1);
243 y = line_y2 + t * (line_y2 - line_y1);
244 }
245 word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1, word_x2, word_y2)));
246 word_length = word_length * 72.0 / ppi;
247 x = x * 72 / ppi;
248 y = height - (y * 72.0 / ppi);
249 }
250 *x0 = x;
251 *y0 = y;
252 *length = word_length;
253}
254
255// Compute coefficients for an affine matrix describing the rotation
256// of the text. If the text is right-to-left such as Arabic or Hebrew,
257// we reflect over the Y-axis. This matrix will set the coordinate
258// system for placing text in the PDF file.
259//
260// RTL
261// [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
262// [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
263static void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2,
264 double *a, double *b, double *c, double *d) {
265 double theta =
266 atan2(static_cast<double>(line_y1 - line_y2), static_cast<double>(line_x2 - line_x1));
267 *a = cos(theta);
268 *b = sin(theta);
269 *c = -sin(theta);
270 *d = cos(theta);
271 switch (writing_direction) {
273 *a = -*a;
274 *b = -*b;
275 break;
277 // TODO(jbreiden) Consider using the vertical PDF writing mode.
278 break;
279 default:
280 break;
281 }
282}
283
284// There are some really awkward PDF viewers in the wild, such as
285// 'Preview' which ships with the Mac. They do a better job with text
286// selection and highlighting when given perfectly flat baseline
287// instead of very slightly tilted. We clip small tilts to appease
288// these viewers. I chose this threshold large enough to absorb noise,
289// but small enough that lines probably won't cross each other if the
290// whole page is tilted at almost exactly the clipping threshold.
291static void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1,
292 int *line_x2, int *line_y2) {
293 *line_x1 = x1;
294 *line_y1 = y1;
295 *line_x2 = x2;
296 *line_y2 = y2;
297 int rise = abs(y2 - y1) * 72;
298 int run = abs(x2 - x1) * 72;
299 if (rise < 2 * ppi && 2 * ppi < run) {
300 *line_y1 = *line_y2 = (y1 + y2) / 2;
301 }
302}
303
304static bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
305 if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
306 tprintf("Dropping invalid codepoint %d\n", code);
307 return false;
308 }
309 if (code < 0x10000) {
310 snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
311 } else {
312 int a = code - 0x010000;
313 int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
314 int low_surrogate = (0x03FF & a) + 0xDC00;
315 snprintf(utf16, kMaxBytesPerCodepoint, "%04X%04X", high_surrogate, low_surrogate);
316 }
317 return true;
318}
319
320char *TessPDFRenderer::GetPDFTextObjects(TessBaseAPI *api, double width, double height) {
321 double ppi = api->GetSourceYResolution();
322
323 // These initial conditions are all arbitrary and will be overwritten
324 double old_x = 0.0, old_y = 0.0;
325 int old_fontsize = 0;
327 bool new_block = true;
328 int fontsize = 0;
329 double a = 1;
330 double b = 0;
331 double c = 0;
332 double d = 1;
333
334 std::stringstream pdf_str;
335 // Use "C" locale (needed for double values prec()).
336 pdf_str.imbue(std::locale::classic());
337 // Use 8 digits for double values.
338 pdf_str.precision(8);
339
340 // TODO(jbreiden) This marries the text and image together.
341 // Slightly cleaner from an abstraction standpoint if this were to
342 // live inside a separate text object.
343 pdf_str << "q " << prec(width) << " 0 0 " << prec(height) << " 0 0 cm";
344 if (!textonly_) {
345 pdf_str << " /Im1 Do";
346 }
347 pdf_str << " Q\n";
348
349 int line_x1 = 0;
350 int line_y1 = 0;
351 int line_x2 = 0;
352 int line_y2 = 0;
353
354 const std::unique_ptr</*non-const*/ ResultIterator> res_it(api->GetIterator());
355 while (!res_it->Empty(RIL_BLOCK)) {
356 if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
357 pdf_str << "BT\n3 Tr"; // Begin text object, use invisible ink
358 old_fontsize = 0; // Every block will declare its fontsize
359 new_block = true; // Every block will declare its affine matrix
360 }
361
362 if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
363 int x1, y1, x2, y2;
364 res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
365 ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
366 }
367
368 if (res_it->Empty(RIL_WORD)) {
369 res_it->Next(RIL_WORD);
370 continue;
371 }
372
373 // Writing direction changes at a per-word granularity
374 tesseract::WritingDirection writing_direction;
375 {
376 tesseract::Orientation orientation;
377 tesseract::TextlineOrder textline_order;
378 float deskew_angle;
379 res_it->Orientation(&orientation, &writing_direction, &textline_order, &deskew_angle);
380 if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
381 switch (res_it->WordDirection()) {
383 writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
384 break;
386 writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
387 break;
388 default:
389 writing_direction = old_writing_direction;
390 }
391 }
392 }
393
394 // Where is word origin and how long is it?
395 double x, y, word_length;
396 {
397 int word_x1, word_y1, word_x2, word_y2;
398 res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
399 GetWordBaseline(writing_direction, ppi, height, word_x1, word_y1, word_x2, word_y2, line_x1,
400 line_y1, line_x2, line_y2, &x, &y, &word_length);
401 }
402
403 if (writing_direction != old_writing_direction || new_block) {
404 AffineMatrix(writing_direction, line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
405 pdf_str << " " << prec(a) // . This affine matrix
406 << " " << prec(b) // . sets the coordinate
407 << " " << prec(c) // . system for all
408 << " " << prec(d) // . text that follows.
409 << " " << prec(x) // .
410 << " " << prec(y) // .
411 << (" Tm "); // Place cursor absolutely
412 new_block = false;
413 } else {
414 double dx = x - old_x;
415 double dy = y - old_y;
416 pdf_str << " " << prec(dx * a + dy * b) << " " << prec(dx * c + dy * d)
417 << (" Td "); // Relative moveto
418 }
419 old_x = x;
420 old_y = y;
421 old_writing_direction = writing_direction;
422
423 // Adjust font size on a per word granularity. Pay attention to
424 // fontsize, old_fontsize, and pdf_str. We've found that for
425 // in Arabic, Tesseract will happily return a fontsize of zero,
426 // so we make up a default number to protect ourselves.
427 {
428 bool bold, italic, underlined, monospace, serif, smallcaps;
429 int font_id;
430 res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace, &serif, &smallcaps,
431 &fontsize, &font_id);
432 const int kDefaultFontsize = 8;
433 if (fontsize <= 0) {
434 fontsize = kDefaultFontsize;
435 }
436 if (fontsize != old_fontsize) {
437 pdf_str << "/f-0-0 " << fontsize << " Tf ";
438 old_fontsize = fontsize;
439 }
440 }
441
442 bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
443 bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
444 std::string pdf_word;
445 int pdf_word_len = 0;
446 do {
447 const std::unique_ptr<const char[]> grapheme(res_it->GetUTF8Text(RIL_SYMBOL));
448 if (grapheme && grapheme[0] != '\0') {
449 std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(grapheme.get());
450 char utf16[kMaxBytesPerCodepoint];
451 for (char32 code : unicodes) {
452 if (CodepointToUtf16be(code, utf16)) {
453 pdf_word += utf16;
454 pdf_word_len++;
455 }
456 }
457 }
458 res_it->Next(RIL_SYMBOL);
459 } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
460 if (res_it->IsAtBeginningOf(RIL_WORD)) {
461 pdf_word += "0020";
462 pdf_word_len++;
463 }
464 if (word_length > 0 && pdf_word_len > 0) {
465 double h_stretch = kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
466 pdf_str << h_stretch << " Tz" // horizontal stretch
467 << " [ <" << pdf_word // UTF-16BE representation
468 << "> ] TJ"; // show the text
469 }
470 if (last_word_in_line) {
471 pdf_str << " \n";
472 }
473 if (last_word_in_block) {
474 pdf_str << "ET\n"; // end the text object
475 }
476 }
477 const std::string &text = pdf_str.str();
478 char *result = new char[text.length() + 1];
479 strcpy(result, text.c_str());
480 return result;
481}
482
484 AppendPDFObject("%PDF-1.5\n%\xDE\xAD\xBE\xEB\n");
485
486 // CATALOG
487 AppendPDFObject(
488 "1 0 obj\n"
489 "<<\n"
490 " /Type /Catalog\n"
491 " /Pages 2 0 R\n"
492 ">>\nendobj\n");
493
494 // We are reserving object #2 for the /Pages
495 // object, which I am going to create and write
496 // at the end of the PDF file.
497 AppendPDFObject("");
498
499 // TYPE0 FONT
500 AppendPDFObject(
501 "3 0 obj\n"
502 "<<\n"
503 " /BaseFont /GlyphLessFont\n"
504 " /DescendantFonts [ 4 0 R ]\n" // CIDFontType2 font
505 " /Encoding /Identity-H\n"
506 " /Subtype /Type0\n"
507 " /ToUnicode 6 0 R\n" // ToUnicode
508 " /Type /Font\n"
509 ">>\n"
510 "endobj\n");
511
512 // CIDFONTTYPE2
513 std::stringstream stream;
514 // Use "C" locale (needed for int values larger than 999).
515 stream.imbue(std::locale::classic());
516 stream << "4 0 obj\n"
517 "<<\n"
518 " /BaseFont /GlyphLessFont\n"
519 " /CIDToGIDMap 5 0 R\n" // CIDToGIDMap
520 " /CIDSystemInfo\n"
521 " <<\n"
522 " /Ordering (Identity)\n"
523 " /Registry (Adobe)\n"
524 " /Supplement 0\n"
525 " >>\n"
526 " /FontDescriptor 7 0 R\n" // Font descriptor
527 " /Subtype /CIDFontType2\n"
528 " /Type /Font\n"
529 " /DW "
530 << (1000 / kCharWidth)
531 << "\n"
532 ">>\n"
533 "endobj\n";
534 AppendPDFObject(stream.str().c_str());
535
536 // CIDTOGIDMAP
537 const int kCIDToGIDMapSize = 2 * (1 << 16);
538 const std::unique_ptr<unsigned char[]> cidtogidmap(new unsigned char[kCIDToGIDMapSize]);
539 for (int i = 0; i < kCIDToGIDMapSize; i++) {
540 cidtogidmap[i] = (i % 2) ? 1 : 0;
541 }
542 size_t len;
543 unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
544 stream.str("");
545 stream << "5 0 obj\n"
546 "<<\n"
547 " /Length "
548 << len
549 << " /Filter /FlateDecode\n"
550 ">>\n"
551 "stream\n";
552 AppendString(stream.str().c_str());
553 long objsize = stream.str().size();
554 AppendData(reinterpret_cast<char *>(comp), len);
555 objsize += len;
556 lept_free(comp);
557 const char *endstream_endobj =
558 "endstream\n"
559 "endobj\n";
560 AppendString(endstream_endobj);
561 objsize += strlen(endstream_endobj);
562 AppendPDFObjectDIY(objsize);
563
564 const char stream2[] =
565 "/CIDInit /ProcSet findresource begin\n"
566 "12 dict begin\n"
567 "begincmap\n"
568 "/CIDSystemInfo\n"
569 "<<\n"
570 " /Registry (Adobe)\n"
571 " /Ordering (UCS)\n"
572 " /Supplement 0\n"
573 ">> def\n"
574 "/CMapName /Adobe-Identify-UCS def\n"
575 "/CMapType 2 def\n"
576 "1 begincodespacerange\n"
577 "<0000> <FFFF>\n"
578 "endcodespacerange\n"
579 "1 beginbfrange\n"
580 "<0000> <FFFF> <0000>\n"
581 "endbfrange\n"
582 "endcmap\n"
583 "CMapName currentdict /CMap defineresource pop\n"
584 "end\n"
585 "end\n";
586
587 // TOUNICODE
588 stream.str("");
589 stream << "6 0 obj\n"
590 "<< /Length "
591 << (sizeof(stream2) - 1)
592 << " >>\n"
593 "stream\n"
594 << stream2
595 << "endstream\n"
596 "endobj\n";
597 AppendPDFObject(stream.str().c_str());
598
599 // FONT DESCRIPTOR
600 stream.str("");
601 stream << "7 0 obj\n"
602 "<<\n"
603 " /Ascent 1000\n"
604 " /CapHeight 1000\n"
605 " /Descent -1\n" // Spec says must be negative
606 " /Flags 5\n" // FixedPitch + Symbolic
607 " /FontBBox [ 0 0 "
608 << (1000 / kCharWidth)
609 << " 1000 ]\n"
610 " /FontFile2 8 0 R\n"
611 " /FontName /GlyphLessFont\n"
612 " /ItalicAngle 0\n"
613 " /StemV 80\n"
614 " /Type /FontDescriptor\n"
615 ">>\n"
616 "endobj\n";
617 AppendPDFObject(stream.str().c_str());
618
619 stream.str("");
620 stream << datadir_.c_str() << "/pdf.ttf";
621 const uint8_t *font;
622 std::ifstream input(stream.str().c_str(), std::ios::in | std::ios::binary);
623 std::vector<unsigned char> buffer(std::istreambuf_iterator<char>(input), {});
624 auto size = buffer.size();
625 if (size) {
626 font = buffer.data();
627 } else {
628#if !defined(NDEBUG)
629 tprintf("Cannot open file \"%s\"!\nUsing internal glyphless font.\n", stream.str().c_str());
630#endif
631 font = pdf_ttf;
632 size = sizeof(pdf_ttf);
633 }
634
635 // FONTFILE2
636 stream.str("");
637 stream << "8 0 obj\n"
638 "<<\n"
639 " /Length "
640 << size
641 << "\n"
642 " /Length1 "
643 << size
644 << "\n"
645 ">>\n"
646 "stream\n";
647 AppendString(stream.str().c_str());
648 objsize = stream.str().size();
649 AppendData(reinterpret_cast<const char *>(font), size);
650 objsize += size;
651 AppendString(endstream_endobj);
652 objsize += strlen(endstream_endobj);
653 AppendPDFObjectDIY(objsize);
654 return true;
655}
656
657bool TessPDFRenderer::imageToPDFObj(Pix *pix, const char *filename, long int objnum,
658 char **pdf_object, long int *pdf_object_size,
659 const int jpg_quality) {
660 if (!pdf_object_size || !pdf_object) {
661 return false;
662 }
663 *pdf_object = nullptr;
664 *pdf_object_size = 0;
665 if (!filename && !pix) {
666 return false;
667 }
668
669 L_Compressed_Data *cid = nullptr;
670
671 int sad = 0;
672 if (pixGetInputFormat(pix) == IFF_PNG) {
673 sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid);
674 }
675 if (!cid) {
676 sad = l_generateCIDataForPdf(filename, pix, jpg_quality, &cid);
677 }
678
679 if (sad || !cid) {
680 l_CIDataDestroy(&cid);
681 return false;
682 }
683
684 const char *group4 = "";
685 const char *filter;
686 switch (cid->type) {
687 case L_FLATE_ENCODE:
688 filter = "/FlateDecode";
689 break;
690 case L_JPEG_ENCODE:
691 filter = "/DCTDecode";
692 break;
693 case L_G4_ENCODE:
694 filter = "/CCITTFaxDecode";
695 group4 = " /K -1\n";
696 break;
697 case L_JP2K_ENCODE:
698 filter = "/JPXDecode";
699 break;
700 default:
701 l_CIDataDestroy(&cid);
702 return false;
703 }
704
705 // Maybe someday we will accept RGBA but today is not that day.
706 // It requires creating an /SMask for the alpha channel.
707 // http://stackoverflow.com/questions/14220221
708 std::stringstream colorspace;
709 // Use "C" locale (needed for int values larger than 999).
710 colorspace.imbue(std::locale::classic());
711 if (cid->ncolors > 0) {
712 colorspace << " /ColorSpace [ /Indexed /DeviceRGB " << (cid->ncolors - 1) << " "
713 << cid->cmapdatahex << " ]\n";
714 } else {
715 switch (cid->spp) {
716 case 1:
717 if (cid->bps == 1 && pixGetInputFormat(pix) == IFF_PNG) {
718 colorspace.str(
719 " /ColorSpace /DeviceGray\n"
720 " /Decode [1 0]\n");
721 } else {
722 colorspace.str(" /ColorSpace /DeviceGray\n");
723 }
724 break;
725 case 3:
726 colorspace.str(" /ColorSpace /DeviceRGB\n");
727 break;
728 default:
729 l_CIDataDestroy(&cid);
730 return false;
731 }
732 }
733
734 int predictor = (cid->predictor) ? 14 : 1;
735
736 // IMAGE
737 std::stringstream b1;
738 // Use "C" locale (needed for int values larger than 999).
739 b1.imbue(std::locale::classic());
740 b1 << objnum
741 << " 0 obj\n"
742 "<<\n"
743 " /Length "
744 << cid->nbytescomp
745 << "\n"
746 " /Subtype /Image\n";
747
748 std::stringstream b2;
749 // Use "C" locale (needed for int values larger than 999).
750 b2.imbue(std::locale::classic());
751 b2 << " /Width " << cid->w
752 << "\n"
753 " /Height "
754 << cid->h
755 << "\n"
756 " /BitsPerComponent "
757 << cid->bps
758 << "\n"
759 " /Filter "
760 << filter
761 << "\n"
762 " /DecodeParms\n"
763 " <<\n"
764 " /Predictor "
765 << predictor
766 << "\n"
767 " /Colors "
768 << cid->spp << "\n"
769 << group4 << " /Columns " << cid->w
770 << "\n"
771 " /BitsPerComponent "
772 << cid->bps
773 << "\n"
774 " >>\n"
775 ">>\n"
776 "stream\n";
777
778 const char *b3 =
779 "endstream\n"
780 "endobj\n";
781
782 size_t b1_len = b1.str().size();
783 size_t b2_len = b2.str().size();
784 size_t b3_len = strlen(b3);
785 size_t colorspace_len = colorspace.str().size();
786
787 *pdf_object_size = b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
788 *pdf_object = new char[*pdf_object_size];
789
790 char *p = *pdf_object;
791 memcpy(p, b1.str().c_str(), b1_len);
792 p += b1_len;
793 memcpy(p, colorspace.str().c_str(), colorspace_len);
794 p += colorspace_len;
795 memcpy(p, b2.str().c_str(), b2_len);
796 p += b2_len;
797 memcpy(p, cid->datacomp, cid->nbytescomp);
798 p += cid->nbytescomp;
799 memcpy(p, b3, b3_len);
800 l_CIDataDestroy(&cid);
801 return true;
802}
803
805 Pix *pix = api->GetInputImage();
806 const char *filename = api->GetInputName();
807 int ppi = api->GetSourceYResolution();
808 if (!pix || ppi <= 0) {
809 return false;
810 }
811 double width = pixGetWidth(pix) * 72.0 / ppi;
812 double height = pixGetHeight(pix) * 72.0 / ppi;
813
814 std::stringstream xobject;
815 // Use "C" locale (needed for int values larger than 999).
816 xobject.imbue(std::locale::classic());
817 if (!textonly_) {
818 xobject << "/XObject << /Im1 " << (obj_ + 2) << " 0 R >>\n";
819 }
820
821 // PAGE
822 std::stringstream stream;
823 // Use "C" locale (needed for double values width and height).
824 stream.imbue(std::locale::classic());
825 stream.precision(2);
826 stream << std::fixed << obj_
827 << " 0 obj\n"
828 "<<\n"
829 " /Type /Page\n"
830 " /Parent 2 0 R\n" // Pages object
831 " /MediaBox [0 0 "
832 << width << " " << height
833 << "]\n"
834 " /Contents "
835 << (obj_ + 1)
836 << " 0 R\n" // Contents object
837 " /Resources\n"
838 " <<\n"
839 " "
840 << xobject.str() << // Image object
841 " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
842 " /Font << /f-0-0 3 0 R >>\n" // Type0 Font
843 " >>\n"
844 ">>\n"
845 "endobj\n";
846 pages_.push_back(obj_);
847 AppendPDFObject(stream.str().c_str());
848
849 // CONTENTS
850 const std::unique_ptr<char[]> pdftext(GetPDFTextObjects(api, width, height));
851 const size_t pdftext_len = strlen(pdftext.get());
852 size_t len;
853 unsigned char *comp_pdftext =
854 zlibCompress(reinterpret_cast<unsigned char *>(pdftext.get()), pdftext_len, &len);
855 long comp_pdftext_len = len;
856 stream.str("");
857 stream << obj_
858 << " 0 obj\n"
859 "<<\n"
860 " /Length "
861 << comp_pdftext_len
862 << " /Filter /FlateDecode\n"
863 ">>\n"
864 "stream\n";
865 AppendString(stream.str().c_str());
866 long objsize = stream.str().size();
867 AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
868 objsize += comp_pdftext_len;
869 lept_free(comp_pdftext);
870 const char *b2 =
871 "endstream\n"
872 "endobj\n";
873 AppendString(b2);
874 objsize += strlen(b2);
875 AppendPDFObjectDIY(objsize);
876
877 if (!textonly_) {
878 char *pdf_object = nullptr;
879 int jpg_quality;
880 api->GetIntVariable("jpg_quality", &jpg_quality);
881 if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize, jpg_quality)) {
882 return false;
883 }
884 AppendData(pdf_object, objsize);
885 AppendPDFObjectDIY(objsize);
886 delete[] pdf_object;
887 }
888 return true;
889}
890
892 // We reserved the /Pages object number early, so that the /Page
893 // objects could refer to their parent. We finally have enough
894 // information to go fill it in. Using lower level calls to manipulate
895 // the offset record in two spots, because we are placing objects
896 // out of order in the file.
897
898 // PAGES
899 const long int kPagesObjectNumber = 2;
900 offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
901 std::stringstream stream;
902 // Use "C" locale (needed for int values larger than 999).
903 stream.imbue(std::locale::classic());
904 stream << kPagesObjectNumber << " 0 obj\n<<\n /Type /Pages\n /Kids [ ";
905 AppendString(stream.str().c_str());
906 size_t pages_objsize = stream.str().size();
907 for (const auto &page : pages_) {
908 stream.str("");
909 stream << page << " 0 R ";
910 AppendString(stream.str().c_str());
911 pages_objsize += stream.str().size();
912 }
913 stream.str("");
914 stream << "]\n /Count " << pages_.size() << "\n>>\nendobj\n";
915 AppendString(stream.str().c_str());
916 pages_objsize += stream.str().size();
917 offsets_.back() += pages_objsize; // manipulation #2
918
919 // INFO
920 std::string utf16_title = "FEFF"; // byte_order_marker
921 std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(title());
922 char utf16[kMaxBytesPerCodepoint];
923 for (char32 code : unicodes) {
924 if (CodepointToUtf16be(code, utf16)) {
925 utf16_title += utf16;
926 }
927 }
928
929 char *datestr = l_getFormattedDate();
930 stream.str("");
931 stream << obj_
932 << " 0 obj\n"
933 "<<\n"
934 " /Producer (Tesseract "
935 << tesseract::TessBaseAPI::Version()
936 << ")\n"
937 " /CreationDate (D:"
938 << datestr
939 << ")\n"
940 " /Title <"
941 << utf16_title.c_str()
942 << ">\n"
943 ">>\n"
944 "endobj\n";
945 lept_free(datestr);
946 AppendPDFObject(stream.str().c_str());
947 stream.str("");
948 stream << "xref\n0 " << obj_ << "\n0000000000 65535 f \n";
949 AppendString(stream.str().c_str());
950 for (int i = 1; i < obj_; i++) {
951 stream.str("");
952 stream.width(10);
953 stream.fill('0');
954 stream << offsets_[i] << " 00000 n \n";
955 AppendString(stream.str().c_str());
956 }
957 stream.str("");
958 stream << "trailer\n<<\n /Size " << obj_
959 << "\n"
960 " /Root 1 0 R\n" // catalog
961 " /Info "
962 << (obj_ - 1)
963 << " 0 R\n" // info
964 ">>\nstartxref\n"
965 << offsets_.back() << "\n%%EOF\n";
966 AppendString(stream.str().c_str());
967 return true;
968}
969} // namespace tesseract
struct TessBaseAPI TessBaseAPI
Definition: capi.h:60
signed int char32
void tprintf(const char *format,...)
Definition: tprintf.cpp:41
signed int char32
Definition: unichar.h:49
@ DIR_LEFT_TO_RIGHT
Definition: unichar.h:43
@ DIR_RIGHT_TO_LEFT
Definition: unichar.h:44
@ WRITING_DIRECTION_TOP_TO_BOTTOM
Definition: publictypes.h:132
@ WRITING_DIRECTION_LEFT_TO_RIGHT
Definition: publictypes.h:130
@ WRITING_DIRECTION_RIGHT_TO_LEFT
Definition: publictypes.h:131
const char * GetInputName()
Definition: baseapi.cpp:928
bool GetIntVariable(const char *name, int *value) const
Definition: baseapi.cpp:294
void AppendString(const char *s)
Definition: renderer.cpp:111
const char * title() const
Definition: renderer.h:87
void AppendData(const char *s, int len)
Definition: renderer.cpp:118
bool EndDocumentHandler() override
bool BeginDocumentHandler() override
TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly=false)
bool AddImageHandler(TessBaseAPI *api) override
static std::vector< char32 > UTF8ToUTF32(const char *utf8_str)
Definition: unichar.cpp:220