Prompt Engineering: What’s the Catch?

Sean Wu
7 min readNov 15, 2023

A well-tuned Language Model (LLM) like ChatGPT can be thought of as a runtime or interpreter, capable of ‘executing’ natural language instructions to solve a wide range of tasks, often with surprising effectiveness. This so-called prompt engineering requires almost no learning, so what’s the catch? Can this method, where we use natural language to express instructions and have them ‘executed’ on an instruct-tuned LLM runtime, be sufficient for building large and dependable systems? First, let’a ask:

Why prompt engineering is attracting our attention?

Why is prompt engineering exciting? From the developer’s perspective, two aspects stand out:

  1. No learning curve, use every day, plain language. Instead of learning a programming language, using plain language to solve problems has many advantages: you do not need to learn any new syntax and semantics, and you do not have to fiddle with compilers, interpreters, runtimes, and libraries; it just works.
  2. Declaratively solving problems, no need to understand how. Compared to software engineering, for the most part, you need to understand how to solve a particular problem in terms of step-by-step instructions. But the solution to some problems is just hard to express this way, such as writing a statement to dispute a traffic ticket. Being able to just express what you want and get the result is a great developer experience.

Prompt engineering significantly lowers the barrier for non-technical users to build useful software and solve real-world problems. But, can we use plain language declaratively solve all the problems, from both language and interpreter perspectives?

Is plain language enough, for all domains?

Have you ever read lengthy legal documents where you understand every word, but they don’t make sense as a whole? Then you know plain language is never enough. One can safely argue that these legal documents are written in a different, domain-specific language (DSL). Such domain-specific language is not limited to legal; it is prevalent across various professions — doctors, pharmacists, and more. For programmers, well, they have invented numerous programming languages.

Domain specific problem needs domain specific language

Because we, as humans, cannot live forever, and the complexity of the world and tasks we want to undertake is finite, a division of knowledge and skills naturally emerges in society. To efficiently address these complexities, individuals specialize in specific areas. This specialization leads to the development of expertise in particular fields, creating a division of knowledge. The learning and perfection of these specializations require more and more time, so it only makes economic sense that few people know how to make a match while everyone knows how to use it.

With the division of knowledge, domain-specific languages become necessary. Plain English mostly consists of words and phrases that are widely understood by native speakers and do not require specialized knowledge or terminology. Lacking specific terminology and constructs in domain-specific language, plain language is not efficient for expressing both the problem and solution in many real-world scenarios, as one might need much longer text to explain a single terminology.

Semantics needs to be predefined

Plain language are designed to be used in daily life for communication between individuals that potentially never have met before and commonly raised in the different school system. The plain language does not have well defined semantics. Different individuals may express the same idea differently and the same statement can be understood differently, so it always subject to interpretation.

There is a major downside to lacking predefined semantics: one always needs to verify whether the recipient actually interprets the sentence as intended. That’s why even for the most experienced prompt engineers, they need to find the working prompt through trial-and-error. This is also why so-called guardrails are the first things you need when building LLM applications.

In contrast, domain-specific languages have well-defined semantics, resulting in fewer inconsistent problem-solving outcomes. Experienced developers can write the code once and achieve the desired result without needing to test the waters. By investing time in learning the new terminology and constructs, one can precisely and consistently define the problems and specify solutions for complex issues, which can be more cost-effective in the long run.

Verdict: not really

Why do so many people know how to use a match, but so few people know how to make a match? The economic structure of societies often demands a division of labor, resulting in many people being users or consumers rather than creators or producers. This is why plain language, while promising a great developer experience due to its low learning curve, has limited power in defining and solving complex problems. While good as an interface for everyday services, it is not sufficient to solve all possible problems. Domain-specific languages are still needed.

Is LLM a good runtime?

If we think of the prompt as natural language instructions, then the LLM is the corresponding interpreter or runtime. Aside from how expressive plain language is, the effectiveness of prompt engineering as a development paradigm also depends on how powerful the LLM is as a runtime.

Given the fact that LLMs are generally trained on a significant portion of all the public text that we can find on the web, it is no surprise that they can be a powerful runtime that solves many problems, but by no means are they dependable.

LLM does not cover facts well.

There are many research on the boundary of the LLM capabilities. For example, a research from Meta introduced the first benchmark designed to assess the ability of LLMs to internalize head, torso, and tail facts. Using a new evaluation methodology with appropriate metrics for automatically evaluating LLMs’ factuality, the study shows that even the most advanced LLMs have notable limitations in representing factual knowledge, particularly for the torso and tail entities. As such, they suggest it is better to seamlessly blend knowledge in the symbolic form and neural form.

A related study from Google Deepmind empirically explored the role of the pretraining data composition on the ability of pretrained transformers to in-context learn function classes both inside and outside the support of
their pretraining data distribution. They have empirically shown while transformers can generalize effectively on rarer sections of the function-class space, but they can easily break down as the tasks become out-of-distribution. Unfortunately, many valuable knowledge are simply not in the public domain, for example, the recipe for coco cola, data mix used by ChatGPT training. For these knowledge, there is no hope for LLM to get it right.

Declarative is not always enough

It is common that even the problems that LLM have coverage, simple declarative prompt might not always work. For example, Introduced in Wei et al. (2022), chain-of-thought (CoT) prompting enables complex reasoning capabilities through intermediate reasoning steps. You can combine it with few-shot prompting (also known as in-context learning) to get better results on more complex tasks that require reasoning before responding. Why CoT is not exactly imperative as it is not the step-by-step algorithm for solving the problem, it is not declarative anymore since it is a step-by-step demonstration of how to solve the problem. This certain demands more and is considered a pure negative for developer experience.

Develop against interface not implementation

When you develop against an interface, you are essentially programming to an abstraction or a contract rather than a specific implementation. This promotes a more modular, flexible, and maintainable codebase, as your code remains untouched when the implementation of the interface against which your code is written changes.

This is another reason why LLMs are not dependable. Currently, LLMs operate as black boxes, and their performance relies on a myriad of decisions involved in training, data mixture, shuffling, learning algorithm, rate, and architecture. Given the substantial resources required to train an LLM, each pre-trained LLM is unique. There is no guarantee from the pretraining process that the LLM’s capabilities will not regress. The prompt that worked on one particular model may not work for a different version of this model, and certainly not for another model. Unfortunately, applications built this way do not have a good migration path yet.

Parting words

Prompt engineering offers an excellent developer experience. However, employing plain language as instruction and LLM as runtime presents some notable drawbacks. Therefore, it is crucial to carefully decide when to opt for prompt engineering and when to resort to software engineering. In general, if problem is shallow (language level problem) or soft (either open-ended, or where incorrect answer is ok), prompt engineering is a suitable solution. Otherwise, it is advisable to adopt a dual process approach by combining prompt engineering and software engineering.