Handle Attachments
PDF documents can contain attachments, from time to time named embedded file as well.
Retrieve Attachments
Attachments have a name, but it might not be unique. For this reason, the value of reader.attachments["attachment_name"]
is a list.
You can extract all attachments like this:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for name, content_list in reader.attachments.items():
for i, content in enumerate(content_list):
with open(f"out-attachment-{i}-{name}", "wb") as fp:
fp.write(content)
Alternatively, you can retrieve them in an object-oriented fashion if you need further details for these files:
from pypdf import PdfReader
reader = PdfReader("example.pdf")
for attachment in reader.attachment_list:
print(attachment.name, attachment.alternative_name, attachment.content)
Add Attachments
To add a new attachment, use the following code:
from pypdf import PdfWriter
writer = PdfWriter(clone_from="example.pdf")
writer.add_attachment(filename="test.txt", data=b"Hello World!")
As you can see, the basic attachment properties are its name and content. If you want to modify further properties of it, the returned object provides corresponding setters:
import datetime
import hashlib
from pypdf import PdfWriter
from pypdf.generic import create_string_object, ByteStringObject, NameObject, NumberObject
writer = PdfWriter(clone_from="example.pdf")
embedded_file = writer.add_attachment(filename="test.txt", data=b"Hello World!")
embedded_file.size = NumberObject(len(b"Hello World!"))
embedded_file.alternative_name = create_string_object("test1.txt")
embedded_file.description = create_string_object("My test file")
embedded_file.subtype = NameObject("/text/plain")
embedded_file.checksum = ByteStringObject(hashlib.md5(b"Hello World!").digest())
embedded_file.modification_date = datetime.datetime.now(tz=datetime.timezone.utc)
# embedded_file.content = "My new content."
writer.write("out-add-attachment.pdf")
The same functionality is available if you iterate over the attachments of a writer
using writer.attachment_list.
Delete Attachments
To delete an existing attachment, use the following code:
from pypdf import PdfWriter
writer = PdfWriter(clone_from="example.pdf")
attachment = writer.add_attachment(filename="test.txt", data=b"Hello World!")
attachment.delete()
assert list(writer.attachment_list) == []
Please note that this will not delete the associated file relationship if it exists. Deleting them as well would require us to know where this has been defined, which requires more complexity. For now, please consider looking for the corresponding definition yourself and delete it from the array.
PDF/A compliance
The following example shows how to add an attachment to a PDF/A-3B compliant document without breaking compliance:
from pypdf import PdfWriter
from pypdf.constants import AFRelationship
from pypdf.generic import create_string_object, ArrayObject, NameObject
writer = PdfWriter(clone_from="example.pdf")
attachment = writer.add_attachment(filename="test.txt", data="Hello World!")
attachment.subtype = NameObject("/text/plain")
attachment.associated_file_relationship = NameObject(AFRelationship.SUPPLEMENT)
attachment.alternative_name = create_string_object(attachment.name)
if "/AF" in writer.root_object:
af = writer.root_object["/AF"].get_object()
else:
af = ArrayObject()
writer.root_object[NameObject("/AF")] = af
af.append(attachment.pdf_object.indirect_reference)
writer.write("out-a3b.pdf")
This example marks a relationship of the attachment to the whole document. Alternatively, it can be added to most of the other PDF objects as well. For details, see the corresponding PDF specification, like section 14.13 of the PDF 2.0 specification.