Call the Plumber: Your Documents are Leaking

For most organizations, posting brochures, contract templates, whitepapers, and various forms of marketing collateral online is a standard practice. And for most cyber criminals, this can surreptitiously provide a wealth of information about the organization they are targeting.

In this blog post, we will examine why cyber criminals benefit from the public sharing of organizational documents, how they make use of the metadata contained in the documents, how misconfigurations and lack of user awareness can lead to data leaks, and things your organization can do to protect itself.

Document Metadata

The most common document types used in a business setting (docx, pdf, xlsx, pptx, etc.) contain a wealth of meta information within the files properties. This information is used to automatically track the files authors, versions, save dates, and a litany of other properties associated with the file.

Some of the properties for Office files (.doc, .docx, .pptx, etc.) include:

Some of the properties for PDFs include:

Attacker Advantages

From the adversary’s perspective, much can be gained from the analysis of the metadata and content of documents which have been indexed by search engines. With tools like PyMeta and PowerMeta, attackers can scrape available metadata on a targeted organization. Primary points of interest in Metadata for attackers is information about the author and the software (including specific versions) used to create the document. The author information can reveal either employee names or username associated with the account which created the document, revealing internal username conventions. Knowledge of these conventions can be used to enumerate additional usernames for use in credential stuffing attacks. Knowledge of internal software suites and packages in use internally is valuable for the efficacy of a payload being developed. If the payload being delivered to target(s) can be verified as compatible with the specific versions of the software in use internally (i.e. versions of Microsoft Word, Adobe Suites, etc.), the chance of successful payload execution increase.

In addition to the content of the metadata, the documents themselves can reveal a litany of valuable internal information. In the course of research based scanning of companies, Foretrace has uncovered no shortage of organizations (particularly in the financial and healthcare verticals) which have accidentally exposed documents (often due to misconfigurations in file storage solutions) which contain internal network login instructions, policies, audit results, marketing plans, financial forecasts, and in many cases, documents which have already been tagged by internal software as “confidential”, yet still made their way online (as shown in the Screenshot of a Foretrace console below). The exposure of these documents take a lot off the plate of an attacker, in some cases providing step-by-step instructions to access internal assets from the internet, which combined with metadata which exposes Active Directory naming conventions, dramatically simplifies the path to successful intrusion.


Releasing documents on public-facing sites may be a requirement for the functionality of web services and the delivery of important information to customers. In the event documents are going to be released on public-facing websites, the metadata should be stripped from the document. This can be in the properties menu of most common office document types. While this will not automatically strip all pieces of accompanying metadata which can be extracted, it will strip a majority of them, including those visible in the properties menu which may reveal internal information and conventions. As a phase in the process of posting an internally produced document for public consumption, a step should be amended to the procedure to strip the metadata from the document before publication (by at least creating a copy with the properties and personal information removed via the documents Properties menu). This ensures only the intended content of the document is released to the public.

While the common case for document publication is intentional, it is not uncommon for documents to be indexed by search engines due to misconfigurations in web server software and/or file sharing and storage services. The use of a Cloud Access Security Broker (CASB) solution, or simply dedicated policy and configuration reviews on a re-occurring basis, can be used to prevent accidental release or disclosure of internal documents.


For a majority of large organizations, no formal security review of publicly released documents has ever existed in the lifecycle of publication. Additionally, numerous siloed departments may release documents to different sites for different purposes and at different cadences (as needed, monthly, daily). In addition to modifying existing procedures, and introducing metadata removal to the publication lifecycle, performing dedicated targeted searches on popular search engines (Google, Bing), popularly called Dorking, can shed light onto an organizations existing exposure. Internal security teams can perform periodic manual searches through search engines. An inventory of files posted to the public site, and the metadata within those files, can be easily collected via open source tools like PyMeta and PowerMeta. These documents can be manually searched to confirm that the documents themselves were intentionally released, in addition to simply executing dedicated searches for confidential content which may have been exposed which was NOT released on the organizations sites, but perhaps accidentally by a misconfiguration by the organization, a vendor or a third party, or alternatively with intention by a cyber criminal group.

Example Searches to Detect Data Leaks or Accidental Disclosure

Find exposed documents on a specific sites: filetype:pdf intext:confidential filetype:doc intext:proprietary

An alternative structure, if a specific marking is used to indicate confidential documents, or specific confidential project names are used to mark nonpublic initiatives (for example, a fictitious merger called Project Duragno):

Intitle:” Example Org LLC” intext: “Project Durango”

Examples of dorks to search for a variety of misconfigurations, outside of just document exposure or data leaks, can be found on Exploit-DB.

Examples of misconfigurations exposing confidential files are plentiful, generally create sincere security incidents, and are relatively embarrassing due to their very public nature (see recently: Digital Ocean) . In the course of our regular scans, Foretrace detects internal and confidential documents on a very frequent basis in even the most mature security operations departments, primarily due to the distance between the teams publishing documents and the information security organization.

For perspective, when performing broad document searches (pdf,docx,xlsx,pptx,txt) against 3 major US banks, over 284,436 documents results are returned, with 138 of them containing the term ‘confidential’ (A few sanitized examples in the Foretrace console are included in the screenshot below). A similar scale of results can be found when executing searches against mid-size and large organizations in most verticals.

A need to detect this type of exposure is growing for organizations, as a trend is emerging among attackers increasingly to rely on holding data for ransom, rather than systems (see Canada Post, Sturdy Memorial Hospital, etc.). And in some cases, internal compromise wasn’t necessary at all, as the organization had already accidentally published data the attackers deemed worthy of holding for ransom. Without the maturing of internal procedures to address document data leaks, the modern cyber criminal may not need to perform more than a single Google search to fully execute their attack.

If your organization requires more than what is detailed in this complimentary guide, explore Foretrace’s highly curated menu of services designed to safeguard critical data.

Contributors: Nick Ascoli, Jenna Small

Share This Article

Related Content