You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/documents.rst
+79-3Lines changed: 79 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -208,6 +208,10 @@ Document
208
208
.. attribute:: contributor_organization
209
209
210
210
The organizational affiliation of the user who originally uploaded the document.
211
+
212
+
.. attribute:: contributor_organization_slug
213
+
214
+
The slug (url friendly identifier) of the organization that the user who originally uploaded the document belongs to.
211
215
212
216
.. attribute:: created_at
213
217
@@ -225,7 +229,6 @@ Document
225
229
226
230
Keys must be strings and only contain alphanumeric characters.
227
231
228
-
229
232
.. attribute:: description
230
233
231
234
A summary of the document. Can be edited and saved with a put command.
@@ -274,6 +277,9 @@ Document
274
277
>>> client.documents.get(new.id).get_errors()
275
278
[{'id': 96136, 'created_at': datetime.datetime(2023, 8, 30, 16, 28, 8, 594859), 'message': '404 Client Error: Not Found for url: https://www.launchcamden.com/wp-content/uploads/2023/08/7.13.23_01002.pdf'}]
276
279
280
+
.. method:: get_json_text()
281
+
282
+
Returns the full text of the document, in a custom JSON format, indexed by page. May also be referenced shorthand as ``json_text``. Useful if trying to compare text throughout the document without making an API call to get the text of each page. Consult the full API documentation for more details.
277
283
278
284
.. method:: get_page_text(page)
279
285
@@ -284,6 +290,10 @@ Document
284
290
# Let's print just the first line
285
291
>>> print(txt.split("\n")[0])
286
292
STATE OF CALIFORNIA- HEALTH AND HUMAN SERVICES AGENCY
293
+
294
+
.. method:: get_page_text_url(page)
295
+
296
+
Retrieve the link to the static asset where the page's plaintext is available. If the document is public, the URL will point to S3, otherwise it will point to an internal DocumentCloud URL to verify that the user has permissions to view the page.
Submit a page number and receive a link to the static asset where page text position information is in JSON format. If the document is public, the URL will point to S3, otherwise it will point to an internal DocumentCloud URL to verify that the user has permissions to view the page.
309
+
296
310
.. attribute:: id
297
311
298
312
The unique identifer of the document in DocumentCloud's system. This is a number.
299
313
``83251``
300
314
315
+
.. attribute:: json_text_url
316
+
317
+
A link to the static resource where the full text of the document, in a custom JSON format, indexed by page is available.
318
+
301
319
.. attribute:: language
302
320
303
321
The three character code for the language this document is in.
@@ -330,6 +348,11 @@ Document
330
348
>>> obj.mentions
331
349
[<Mention: Page 2>, <Mention: Page 3> ....
332
350
351
+
.. attribute:: noindex
352
+
353
+
A boolean indicating whether the document is hidden from search engines and DocumentCloud search.
354
+
A document may be public and embedded on a site, but still have noindex set to True so that the document isn't indexed on search engines. Private documents of course are not searchable on search engines regardless.
355
+
333
356
.. attribute:: normal_image
334
357
335
358
Returns the binary data for the "normal" sized image of the document's
@@ -356,6 +379,10 @@ Document
356
379
357
380
The ID for the organization which owns this document
358
381
382
+
.. attribute:: original_extension
383
+
384
+
The original file extension of the document before it was converted into a PDF during DocumentCloud processing.
385
+
359
386
.. attribute:: page_count
360
387
361
388
Alias for :attr:`pages`.
@@ -370,6 +397,24 @@ Document
370
397
371
398
The number of pages in the document.
372
399
400
+
.. attribute:: page_position_json
401
+
402
+
The raw positions of text on the first page, in a custom JSON format. Consult the API documentation for more details. Each unit (word or letter) in the document will have coordinates. To get a different page use ``get_page_position_json(page)``.
403
+
404
+
.. attribute:: page_position_json_url
405
+
406
+
A link to the static asset where the first page of page positions in custom JSON format is available. Each unit (word or letter) in the document will have coordinates. To get a link to a different page use
407
+
``get_page_position_json_url(page)``. If the document is public, the URL will point to S3, otherwise it will point to an internal DocumentCloud URL to verify that the user has permissions to view the page.
408
+
409
+
.. attribute:: page_text
410
+
411
+
The document's first page in plaintext format. To get a different page use
412
+
``get_page_text(page)``.
413
+
414
+
.. attribute:: page_text_url
415
+
416
+
A link to the static asset where the document's first page in plaintext format is available. To get a different page use ``get_page_text_url(page)``. If the document is public, the URL will point to S3, otherwise it will point to an internal DocumentCloud URL to verify that the user has permissions to view the page.
417
+
373
418
.. attribute:: pdf
374
419
375
420
Returns the binary data for document's original PDF file.
@@ -382,13 +427,22 @@ Document
382
427
383
428
Returns a list of IDs for the projects this document is in.
384
429
430
+
.. attribute:: publish_at
431
+
432
+
A timestamp (Date Time) when to automatically make this document public in a scheduled manner.
433
+
385
434
.. attribute:: published_url
386
435
387
436
Returns an URL outside of documentcloud.org where this document has been published.
388
437
389
438
.. attribute:: related_article
390
439
391
440
Returns an URL for a news story related to this document.
441
+
442
+
.. attribute:: revision_control
443
+
444
+
A boolean indicating whether or not this document has revision control enabled.
445
+
Revision control is only available to DocumentCloud premium users.
392
446
393
447
.. attribute:: sections
394
448
@@ -439,11 +493,11 @@ Document
439
493
440
494
Returns a URL containing the "thumbnail" sized image of the document's
441
495
first page. If you would like the URL for some other page, pass the page
442
-
number into ``get_small_thumbnail_url(page)``.
496
+
number into ``get_thumbnail_image_url(page)``.
443
497
444
498
.. attribute:: thumbnail_image_url_list
445
499
446
-
Returns a list of URLs for the "small" sized image of every page in the document.
500
+
Returns a list of URLs for the "thumbnail" sized image of every page in the document.
447
501
448
502
.. attribute:: title
449
503
@@ -463,6 +517,28 @@ Document
463
517
464
518
The ID for the user which owns this document
465
519
520
+
.. attribute:: writable_fields
521
+
522
+
Useful quick reference list for which fields a user may modify.
523
+
Includes `access`, `data`, `description`, `language`, `publish_at`, `published_url`, `related_article`, `source`, and `title`.
524
+
525
+
.. attribute:: xlarge_image
526
+
527
+
Returns the binary data for the "xlarge" sized image of the document's
528
+
first page. If you would like the data for some other page, pass the page
529
+
number into ``get_xlarge_image(page)``.
530
+
531
+
.. attribute:: xlarge_image_url
532
+
533
+
Returns a URL containing the "xlarge" sized image of the document's
534
+
first page. If you would like the URL for some other page, pass the page
535
+
number into ``get_xlarge_image_url(page)``.
536
+
537
+
.. attribute:: xlarge_image_url_list
538
+
539
+
Returns a list of URLs for the "xlarge" sized image of every page in the document.
0 commit comments