niklas muhs




4 days


interaction conceptualization, implementation


exploration of an integrated workflow for automatic prompt engineering based on dspy and extended through an interface.


when developing applications on top of large language model (llm) in an exploitative space aiming to limit uncertain outcomes, it can be difficult to determine when a prompt is accurate and adaptable enough for deployment.

in addition, identifying the remaining issues and finding systemic ways to address them becomes more difficult when evaluating prompts at scale.


conventionally, we write prompts hoping that the model will infer the desired outcomes from the inputs. this becomes very opaque due to uncertainty regarding the relevant affordances of the llm.


however, by treating the llm as the black box that it is, the llm could instruct itself using its own affordances, while the user could focus on input and desired output. frameworks such as dspy already allow for optimizations like this.

llms can generate and test a variety of prompts to align with the desired outcomes.


in addition, llms can help create new examples that use the optimized prompt and generate edge-case inputs that the user may not have considered, leading to iterative improvement.


the creation of accurately evaluated and large numbers of examples ensures both adaptability to new inputs and accuracy in achieving desired outputs.


try the prototype here:

the workflow offers other advantages as well

when prompt performance reaches a plateau, we can move on to other methods. this makes the transition between prompt engineering and for example finetuning clearer and easier, especially with a base of evaluated examples for finetuning.

a database of examples also facilitates comparison with more efficient models that may have lower affordance discoverability but similar capabilities.

finally, linking this workflow to traces in the production system could lead to continuous product improvement.

next page ->