Skip to content

Commit ea7c8d6

Browse files
Merge pull request #33 from MuckRock/doc_cleanup
Added more documentation, fixed xlarge_image reference
2 parents 5a20bde + bd9d155 commit ea7c8d6

File tree

2 files changed

+80
-4
lines changed

2 files changed

+80
-4
lines changed

docs/documents.rst

Lines changed: 79 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,10 @@ Document
208208
.. attribute:: contributor_organization
209209

210210
The organizational affiliation of the user who originally uploaded the document.
211+
212+
.. attribute:: contributor_organization_slug
213+
214+
The slug (url friendly identifier) of the organization that the user who originally uploaded the document belongs to.
211215

212216
.. attribute:: created_at
213217

@@ -225,7 +229,6 @@ Document
225229

226230
Keys must be strings and only contain alphanumeric characters.
227231

228-
229232
.. attribute:: description
230233

231234
A summary of the document. Can be edited and saved with a put command.
@@ -274,6 +277,9 @@ Document
274277
>>> client.documents.get(new.id).get_errors()
275278
[{'id': 96136, 'created_at': datetime.datetime(2023, 8, 30, 16, 28, 8, 594859), 'message': '404 Client Error: Not Found for url: https://www.launchcamden.com/wp-content/uploads/2023/08/7.13.23_01002.pdf'}]
276279

280+
.. method:: get_json_text()
281+
282+
Returns the full text of the document, in a custom JSON format, indexed by page. May also be referenced shorthand as ``json_text``. Useful if trying to compare text throughout the document without making an API call to get the text of each page. Consult the full API documentation for more details.
277283

278284
.. method:: get_page_text(page)
279285

@@ -284,6 +290,10 @@ Document
284290
# Let's print just the first line
285291
>>> print(txt.split("\n")[0])
286292
STATE OF CALIFORNIA- HEALTH AND HUMAN SERVICES AGENCY
293+
294+
.. method:: get_page_text_url(page)
295+
296+
Retrieve the link to the static asset where the page's plaintext is available. If the document is public, the URL will point to S3, otherwise it will point to an internal DocumentCloud URL to verify that the user has permissions to view the page.
287297

288298
.. method:: get_page_position_json(page)
289299

@@ -293,11 +303,19 @@ Document
293303
>>> obj = client.documents.get('1088501-adventuretime-alta')
294304
>>> json = obj.get_page_position_json(1)
295305

306+
.. method:: get_page_position_json_url(page)
307+
308+
Submit a page number and receive a link to the static asset where page text position information is in JSON format. If the document is public, the URL will point to S3, otherwise it will point to an internal DocumentCloud URL to verify that the user has permissions to view the page.
309+
296310
.. attribute:: id
297311

298312
The unique identifer of the document in DocumentCloud's system. This is a number.
299313
``83251``
300314

315+
.. attribute:: json_text_url
316+
317+
A link to the static resource where the full text of the document, in a custom JSON format, indexed by page is available.
318+
301319
.. attribute:: language
302320

303321
The three character code for the language this document is in.
@@ -330,6 +348,11 @@ Document
330348
>>> obj.mentions
331349
[<Mention: Page 2>, <Mention: Page 3> ....
332350

351+
.. attribute:: noindex
352+
353+
A boolean indicating whether the document is hidden from search engines and DocumentCloud search.
354+
A document may be public and embedded on a site, but still have noindex set to True so that the document isn't indexed on search engines. Private documents of course are not searchable on search engines regardless.
355+
333356
.. attribute:: normal_image
334357

335358
Returns the binary data for the "normal" sized image of the document's
@@ -356,6 +379,10 @@ Document
356379

357380
The ID for the organization which owns this document
358381

382+
.. attribute:: original_extension
383+
384+
The original file extension of the document before it was converted into a PDF during DocumentCloud processing.
385+
359386
.. attribute:: page_count
360387

361388
Alias for :attr:`pages`.
@@ -370,6 +397,24 @@ Document
370397

371398
The number of pages in the document.
372399

400+
.. attribute:: page_position_json
401+
402+
The raw positions of text on the first page, in a custom JSON format. Consult the API documentation for more details. Each unit (word or letter) in the document will have coordinates. To get a different page use ``get_page_position_json(page)``.
403+
404+
.. attribute:: page_position_json_url
405+
406+
A link to the static asset where the first page of page positions in custom JSON format is available. Each unit (word or letter) in the document will have coordinates. To get a link to a different page use
407+
``get_page_position_json_url(page)``. If the document is public, the URL will point to S3, otherwise it will point to an internal DocumentCloud URL to verify that the user has permissions to view the page.
408+
409+
.. attribute:: page_text
410+
411+
The document's first page in plaintext format. To get a different page use
412+
``get_page_text(page)``.
413+
414+
.. attribute:: page_text_url
415+
416+
A link to the static asset where the document's first page in plaintext format is available. To get a different page use ``get_page_text_url(page)``. If the document is public, the URL will point to S3, otherwise it will point to an internal DocumentCloud URL to verify that the user has permissions to view the page.
417+
373418
.. attribute:: pdf
374419

375420
Returns the binary data for document's original PDF file.
@@ -382,13 +427,22 @@ Document
382427

383428
Returns a list of IDs for the projects this document is in.
384429

430+
.. attribute:: publish_at
431+
432+
A timestamp (Date Time) when to automatically make this document public in a scheduled manner.
433+
385434
.. attribute:: published_url
386435

387436
Returns an URL outside of documentcloud.org where this document has been published.
388437

389438
.. attribute:: related_article
390439

391440
Returns an URL for a news story related to this document.
441+
442+
.. attribute:: revision_control
443+
444+
A boolean indicating whether or not this document has revision control enabled.
445+
Revision control is only available to DocumentCloud premium users.
392446

393447
.. attribute:: sections
394448

@@ -439,11 +493,11 @@ Document
439493

440494
Returns a URL containing the "thumbnail" sized image of the document's
441495
first page. If you would like the URL for some other page, pass the page
442-
number into ``get_small_thumbnail_url(page)``.
496+
number into ``get_thumbnail_image_url(page)``.
443497

444498
.. attribute:: thumbnail_image_url_list
445499

446-
Returns a list of URLs for the "small" sized image of every page in the document.
500+
Returns a list of URLs for the "thumbnail" sized image of every page in the document.
447501

448502
.. attribute:: title
449503

@@ -463,6 +517,28 @@ Document
463517

464518
The ID for the user which owns this document
465519

520+
.. attribute:: writable_fields
521+
522+
Useful quick reference list for which fields a user may modify.
523+
Includes `access`, `data`, `description`, `language`, `publish_at`, `published_url`, `related_article`, `source`, and `title`.
524+
525+
.. attribute:: xlarge_image
526+
527+
Returns the binary data for the "xlarge" sized image of the document's
528+
first page. If you would like the data for some other page, pass the page
529+
number into ``get_xlarge_image(page)``.
530+
531+
.. attribute:: xlarge_image_url
532+
533+
Returns a URL containing the "xlarge" sized image of the document's
534+
first page. If you would like the URL for some other page, pass the page
535+
number into ``get_xlarge_image_url(page)``.
536+
537+
.. attribute:: xlarge_image_url_list
538+
539+
Returns a list of URLs for the "xlarge" sized image of every page in the document.
540+
541+
466542
Mentions
467543
--------
468544

documentcloud/documents.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ def __str__(self):
7474
def __getattr__(self, attr):
7575
"""Generate methods for fetching resources"""
7676
p_image = re.compile(
77-
r"^get_(?P<size>thumbnail|small|normal|large)_image_url(?P<list>_list)?$"
77+
r"^get_(?P<size>thumbnail|small|normal|large|xlarge)_image_url(?P<list>_list)?$"
7878
)
7979
get = attr.startswith("get_")
8080
url = attr.endswith("_url")

0 commit comments

Comments
 (0)