Recommendations for developers
As developers integrate LLMs into services, especially when more obscure open-source LLMs are used, implementing custom guardrail pipelines as a layer on top of the integrated LLMs is a necessity. There exist open-source software libraries and frameworks built for this purpose. For example, the highest rated guardrail libraries on Github are Nvidia-Nemo and Guardrails-AI. Both libraries provide a way for users to influence the output of a language model through a schema interface that allows users to establish rules and criteria for what the model produces. Moreover, the libraries offer functions to constrain the model's output to specific topics defined by the schema. Users can also pre-define conversation paths and styles tailored to particular domains or use cases.
While the existing open-source guardrail libraries help ensure safe integration, they have limited interoperability to external open-source models and are not yet effective standalone guardrails independent of a protected OpenAI GPT model. In cases of non-OpenAI model integration, the available alternatives as of March 2024 appear to be (1) thoroughly inspecting the open-source model's model card to understand the presence or absence of guardrails; (2) conduct red-team testing; and (3) developing custom guardrail implementations which combine block lists, output classifiers and sentiment analyzers to filter problematic outputs and steer the model in subsequent inferences.
We also advise LLM adopters to configure their models with as low a temperature as possible to achieve the target task's output in production environments as our experimental results show a rise in privacy leakage with higher temperatures. As conversation histories stored within LLM services such as ChatGPT may be incorporated in continuous training, end-users directly interacting with an LLM or LLM-based service should be cautious of what information they type into such platforms with the awareness that their content may potentially be leaked in future outputs outside of the privacy of their user account.