GitHub Project: Multi-Agent PDF Whisperer Using LangChain and LangGraph
Overview TikTok employs a custom virtual machine (VM) as a security and obfuscation layer for its website and applications. A GitHub project named "TikTok VM Reverse Engineering" aims to reverse engineer this VM to help researchers and developers understand its internal workings. The project provides tools for deobfuscation and debugging, focusing on the webmssdk.js file, which is a pivotal component of TikTok's security infrastructure. Deobfuscation Array Index Obfuscation: The webmssdk.js file heavily relies on array indexing methods (e.g., Gb[index]), where the indices are actually encoded strings. By decoding these strings, all array indices can be replaced with readable dot notation using regular expressions, significantly improving code readability. Function Call Obfuscation: Functions are defined in an array Ab and are called via Ab[index](args), making it challenging to trace function calls. Parsing the script’s abstract syntax tree (AST) and using AI assistance have enabled researchers to convert these calls into a more understandable format. Bytecode Decryption: The VM’s bytecode is stored as a long encoded string, encrypted through XOR operations with a key. Deobfuscation tools can decrypt and extract strings, functions, and metadata from the bytecode using atob and leb128 encoding techniques. Virtual Machine Decompilation TikTok’s VM supports features like scoping, nested functions, and exception handling, reflecting its sophistication. Researchers have manually parsed each conditional branch and used AI to complete other parts, successfully decompiling the VM’s bytecode. While the decompiled code is not entirely readable, it is sufficient to understand the logic of each function. Debugging Since webmssdk.js is a JavaScript file executed in a web environment, developers can use Tampermonkey browser extensions to replace the original webmssdk.js file with the deobfuscated version, facilitating easier testing and debugging during the reverse engineering process. Request Handling TikTok’s server requests include three additional headers: msToken, X-Bogus, and _signature. The latter two are dynamically generated by webmssdk.js: - VM86: This function handles the initial call for each request. - VM113: This function generates X-Bogus. - VM189: This function generates _signature. Unauthenticated Requests For unauthenticated requests (e.g., querying user information), only X-Bogus needs to be generated, which can be done using the window.frontierSign function. msToken can be any arbitrary value. Authenticated Requests For authenticated requests (e.g., posting comments), _signature must also be generated. Researchers developed a signer tool to create URL signatures, and the playwright library was used to automate browser instances and validate the comment posting process. Additional Information The VM includes various methods to prevent automated operations, such as mouse tracking (VM120) and environment checks (VM265). These methods are client-side checks and do not communicate with the server, allowing them to be ignored during signature generation. Industry Evaluation and Company Background The project’s outcomes are significant for understanding TikTok’s security mechanisms and advancing the field of reverse engineering. As one of the largest short-video platforms globally, TikTok’s robust security measures highlight the company’s substantial investment in data protection and technical defense. The project’s detailed technical insights aid developers in comprehending the platform’s workings. However, this deep technical analysis and reverse engineering also spark broader discussions on data privacy and cybersecurity, emphasizing the need for companies to design more secure systems. Industry experts view this project as a testament to the advancements in reverse engineering and the increasing complexity of security measures in technology. RAGent: A New Multi-Agent PDF Assistant RAGent is an innovative PDF assistant developed using LangChain and LangGraph, incorporating retrieval augmented generation (RAG) and agent-based AI. RAG typically involves chunking documents, storing them in vector databases, and retrieving relevant chunks based on user queries to generate responses. RAGent breaks down this process into three specialized tasks: retrieval, augmentation, and generation. The workflow begins with a user query and ends with a detailed response generated by a large language model (LLM). Retrieval Agent The retrieval agent starts by using the extract_text_from_pdf function to read and extract text from PDFs. The text is then cleaned and chunked using the text_to_docs function, with each chunk containing 4000 characters and a 200-character overlap. These chunks are converted into Document objects and stored in a vector database created by create_vectordb, utilizing Facebook AI Similarity Search (FAISS). The retrieve_from_pdf function performs similarity searches based on user queries and returns the most relevant chunk, complete with text content and page numbers. Augmentation Agent The augmentation agent leverages the augment_with_context function to provide additional context to the retrieved text. If the retrieval yields relevant content and page numbers, this function appends the page numbers to the result. If the content is deemed irrelevant or lacks page numbers, it returns the original content with a note indicating the lack of specific page numbers. Generation Agent The generation agent uses a GENERATE_PROMPT and depends on the results from retrieval and augmentation to generate the final response. This response focuses on database management systems (DBMS) and SQL content. Depending on the user’s query type—whether seeking an explanation or a simplification—the agent decides whether to include page numbers. For queries unrelated to DBMS, the response will note that page numbers are not applicable. Workflow Construction To integrate these three agents, the development team employed LangGraph’s StateGraph feature. Each agent is defined as a node in the workflow, with the retrieval agent as the starting point. Conditional edges in the StateGraph determine whether to bypass the augmentation agent and proceed directly to the generation agent based on the presence of relevant retrieval results. Streamlit was used to create a user-friendly and interactive interface, connecting the frontend with the backend workflow. Core Functionality and Highlights RAGent’s primary strength lies in its efficient handling of large PDF documents, generating accurate and context-rich responses. One of its standout features is the ability to provide page numbers when appropriate, enhancing the user’s ability to quickly reference the original text. Additionally, the conditional logic ensures that the system only performs augmentation when necessary, optimizing response speed and accuracy. Industry Evaluation and Company Background Experts in the tech industry have praised RAGent, seeing it as a promising step towards the future of multi-agent AI systems. The tool’s potential applications in education and research are particularly highlighted. LangGraph, a key component of RAGent, is a company known for its lightweight solutions that simplify complex AI workflows, gaining widespread recognition for its innovative approach.
