Real Problems I Encountered When Developing With AI Agents

Previous article: AI in Real Development Work

Introduction

Since the Codex extension in VS Code appeared and became usable enough for everyday work, I have been using AI agents regularly in development. Looking at the whole process rather than isolated demo moments, I think my overall productivity became roughly two to three times higher than development before AI agents.

That is already a very large change. At the same time, I do not think it should be described carelessly. Some individual tasks really can become dozens of times faster. If I need a rough screen prototype, a repetitive TypeScript form adjustment, or a batch of small mechanical edits, the time difference can become almost absurd. But total development productivity does not increase by that same ratio.

The reason is simple. AI agents rarely produce the exact result I want on the first try. Sometimes they break existing code. Sometimes they touch unrelated areas. Sometimes they produce something that looks correct in the UI while silently damaging the actual behavior underneath.

This article is about those kinds of situations. Most of the work I am describing was for internal company systems, so I need to omit or soften certain concrete details where necessary. Even so, the patterns themselves are real.

Unexpected Modifications by AI Agents

The first kind of problem I ran into was unexpectedly broad modification. An AI agent would sometimes change parts of the system that had nothing to do with what I actually instructed.

I noticed this very clearly while developing a company system that manages the flow from order entry to shipping. This was in the latter half of 2025. It was also one of the first truly full-scale systems I built with Cotomy.

Without AI agents, I think that system probably would have taken around a year to finish. With AI agents, most of it was completed in about four months. That difference is too large to dismiss. The productivity benefit was real, visible, and practically important.

At the same time, there were many cases where bugs came from changes I never asked for at all. That became one of the most unnerving parts of AI-driven development.

Order Screen Incident

One of the clearest examples was the order screen. On that screen, the user can select a business partner, quotation data that determines the pricing basis, a destination, and a requester. Depending on how the user arrives there, the business partner or quotation data may already be specified before the screen opens.

If either of those values has already been specified, it must not be editable on the order screen. For safety, the business partner and quotation are treated as non-changeable once they have been fixed in that flow. If one of them is registered incorrectly, the rule is to enter the reason, invalidate that record, and create a new one instead. There is server-side validation for this, but the UI must also prevent editing in the first place.

By that point, the basic behavior of the screen was already working, and I was in the phase of tightening the detailed behavior. The instruction I gave to Codex was not vague. I specified the relevant fields explicitly, and in practice I also attached fairly detailed conditions to when editing should be disabled. At first glance, the result looked correct. The screen showed the expected values, and the disabled state appeared to be working.

Then testing exposed a serious problem. Once the field was disabled, its value was no longer submitted.

The technical reason was annoying in a very specific way. The selection component in question was implemented as a class derived from CotomyElement. It reads several data attributes during initialization and dynamically generates the internal input elements and other required DOM nodes from there. The display-side behavior for a disabled state already existed. Codex ignored all of that architecture and simply disabled the visible element that happened to be on screen.

As a result, the input element required for FormData submission effectively did not exist. The visible label text was filled, so the UI looked fine, but the value itself was never sent to the server.

That kind of bug is especially frustrating. The screen looks correct. The label is there. The user thinks the value exists. And underneath all of that, the submission is already broken. What made me especially angry was that it populated only the visible text, as if it were trying to hide the problem instead of solving it. It is the kind of result that makes you stare at the screen and feel your mood drop immediately.

Second Unintended Modification

Further testing revealed another issue on the same screen. The order screen also contains destination address and requester fields. The destination address master has a default requester associated with it, but the requester itself still needs to remain editable.

I had no intention of making the requester non-editable, and of course I gave no such instruction. Nevertheless, Codex applied the same modification pattern to every component of the same shape. As a result, the requester field also became completely uneditable.

This bug was found during testing, so no production damage occurred. Still, it is deeply unsettling to find bugs in parts of the screen you never asked to change. I think many engineers will immediately understand that feeling.

What made it worse was that my instruction had been extremely explicit. I included the exact field names, and in reality the conditions I gave were even more detailed than that. Even so, the change expanded to other fields without that expansion being reported clearly in the chat output. There are certainly human engineers who make that kind of unilateral generalization too. But I had not expected an AI, which presumably wants to conserve resources wherever possible, to behave in quite that way. That was one of the moments when I started to understand, at a more visceral level, that AI agents can generalize the wrong thing with great confidence.

Implementation Experience With an Item Master

Another case came from the item master. In that system, I chose to treat products and materials as the same entity inside the item master. I do not think that is the kind of design people would normally adopt. But this system had many intermediate products, and shipping those intermediates directly was not unusual. Because of that, I wanted to handle them more flatly, and for this system I still evaluate that decision as rational.

At the same time, I think there are probably many more cases where this shape would not be an appropriate general master design. That is important context, because I do not want this article to sound as if I am presenting that master shape as a general recommendation.

The item master contained several categories such as product items, novelty products, labels, boxes, and other materials. Each category had different fields. For example, GTIN existed only for product items and novelty products, and novelty products were registered with an internal in-house ID.

The actual screen control was more complicated than that summary suggests, but Codex could produce working behavior after only a few rounds of instruction. In that sense, it was undeniably useful. But the resulting code needed heavy refactoring.

That is the point I do not want to hide. I am not programming as a hobby. The system has to remain maintainable whether AI is available or not. If the structure is not resilient to change, then even AI-assisted modification will eventually hit a limit. Because I already understood that much at the time, I always performed code review. And in many cases, I ended up refactoring a large portion of the generated code myself.

Code Quality Problems

If I had to say what the problem was in one phrase, it would simply be this: the code was dirty.

One of the worst patterns was a large number of private methods that were called only once and did nothing except move a piece of logic sideways. Those methods did not have an independent meaning as screen behavior. They simply scattered the code, and once that kind of thing begins to accumulate, following the logic becomes extremely difficult.

Of course, giant methods that contain everything are also bad. That is simply true. Honestly, refactoring from that state would have been easier. In modern environments, especially with strong refactoring tools, restructuring that kind of code is often more straightforward than cleaning up fake modularity that never had any real meaning.

Variable naming also caused friction. Many generated variable names were far too long. They were not technically wrong, but they forced too much effort into simply reading the code.

Ignoring Coding Rules

AI agents also ignored coding style rules surprisingly often.

I have my own rules for writing this kind of code. For example, in PageController code I usually place a private field and the property that exposes it close together, instead of gathering all member variables at the top of the file. I do that because I dislike increasing the distance between things that conceptually belong together. The same idea appears in CotomyElement-related code as well, where I usually implement properties with lazy initialization using ??= on first access.

Those rules were not hidden. They were present in my instructions, in the surrounding code, and often in the concrete examples I gave. Even so, Codex frequently ignored them and generated a completely different shape. Private member variables would be gathered far away from the properties that used them, and sometimes it would even create a separate area just to assign a batch of members together. The result was not necessarily impossible to fix, but it meant the first output often failed at the level of code organization even when the rough behavior looked usable.

This happened often enough that I stopped expecting the correct implementation on the first attempt. The agent could become useful very quickly, but precise compliance with the intended coding style was much less reliable.

Design Misunderstandings

Another recurring problem was design misunderstanding at the type level.

There were times when Codex failed to generate valid TypeScript around CotomyElement.byId. The generic parameter of byId must extend CotomyElement. Even so, Codex sometimes used HTMLElement as the generic parameter.

This was not a subtle issue. It was easily verifiable from the type definitions in node_modules. That made the mistake especially frustrating, because it was not the kind of thing that required deep inference or hidden project knowledge.

Later, I strengthened AGENTS.md and that reduced the frequency of this specific error. But in the early stages of development, many frustrations had exactly this flavor. Unused variables would appear for no real reason. Form classes would be defined in unrelated locations. Pieces of code that should have lived together would be separated without any architectural logic behind the separation.

I also felt that tools like Codex and ChatGPT tended to emphasize responsibility boundaries more when reviewing code than when writing it. When generating code, they often seemed much less strict. Meaningless commonization and strange naming were almost routine.

AI Development Myths

Online, people often say that anyone can build systems with AI now. Some even say that engineers themselves will disappear.

I think those claims are mostly built on a very small set of examples. When people actually show what they built, it is often something extremely simple, like a calculator, a clock, or some other tiny standalone program.

That is not the same thing as building a large system with durable structure. And it is very far from building one safely.

In my experience, building a large system with robust design using AI alone is still nowhere near present reality. Maybe that becomes possible in a very distant future, but at least from where I stand now, that goal is still so far away that it is not even clearly visible yet. Security-related areas are especially dangerous if handled without deep understanding. If I already encountered problems like the ones described above, I do not want to imagine what happens when someone delegates sensitive security behavior to AI without understanding it deeply.

Conclusion

Even after all of these complaints, I still think AI provides extraordinary productivity. It is also true that AI helps me reach levels of quality that I probably could not have reached alone.

The issues I described here are relatively minor in one sense. They are not evidence that AI agents are useless. I think they are problems that can be addressed through stronger design discipline, better instructions, better review habits, clearer architectural boundaries, and improvement in my own engineering skill.

But that does not make them trivial. If my own engineering skill is weak, or if the system design is vague, AI does not save me from that weakness. It amplifies it. At the same time, if I am seriously trying to improve my engineering skill, AI can give remarkable support because of the sheer amount of knowledge it can bring into the conversation.

That is why I still think improving my own engineering skill and system design remains essential. AI is a powerful tool. But how it is used, and whether that power becomes real leverage or just accelerated disorder, still depends on the engineer.

Next article: How AI Will Change Software Development Work