|
They should never have added that "feature" ...
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
I have to extract payment data from PDD's. Some have just one or two lines I need, and other have rows and columns of data.
What's the best way to extract this data using C#?
In theory, theory and practice are the same. But in practice, they never are.”
If it's not broken, fix it until it is.
Everything makes sense in someone's mind.
|
|
|
|
|
Kevin, you've been here long enough; asked enough questions already to know that that's far too vague a query to get anything practical in terms of an answer - all we can do is generically direct you to something like NuGet Gallery | iTextSharp 5.5.13.3[^] and suggest you start there!
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
I was hoping for "I've used...." or "Try this api" kind of answers. I don't have enough info to give more detail because I don't know where to start
In theory, theory and practice are the same. But in practice, they never are.”
If it's not broken, fix it until it is.
Everything makes sense in someone's mind.
|
|
|
|
|
Most people don't use PDFs to transfer data between applications of any kind, so the pool of people who could answer that question is exceedingly small, like none of the regulars around here would have done it.
|
|
|
|
|
If it's a pdf of an image, there's no text either.
There is no "general" solution.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
|
|
|
|
|
I know what you mean, but ... as Dave says PDF is not a data transfer format, it's a user presentation format.
Using it to transfer computer readable info is like using Word to send a bitmap in an email - you could probably do it, but anyone who saw what you were doing would be wondering "what fool came up with *that* idea?"
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
Ya I hear ya. My client has Invices in PDF format that need data extracted. I found https://docs.apryse.com/which looks promising.
In theory, theory and practice are the same. But in practice, they never are.”
If it's not broken, fix it until it is.
Everything makes sense in someone's mind.
|
|
|
|
|
I guess it depends on where the PDF's originate: if it's a single company and they will guarantee to never, ever, ever change the format in writing signed in blood I might give it consideration - but invoices? I can see so many ways in which that could go seriously wrong and somebody end up in jail for tax evasion ... I'd probably decline to quote on that job myself.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
Yes. Should be using EDI ... or "something".
EDI 810 Invoice: Transactions, Format & Specifications | Astera
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
|
|
|
|
|
In some businesses (especially insurance), it was common to use PDFs for data capture purposes. This would then be transferred over to companies to process this and convert the data contained inside into something that could be used in the office.
|
|
|
|
|
Yes, if I remember correctly PDF has a Forms mode which limits what users can enter and where?
But you wouldn't use that for invoices!
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
I've seen this used for something akin to invoices in the past. That's how the commercial insurance industry operates; they do love their PDFs.
|
|
|
|
|
I've used Ghostscript to parse PDF files before. You don't even need to write any code, just use the precompiled tools.
ghostscript extract PDF text
They have a C# wrapper but I've never used it.
|
|
|
|
|
Tika has a PDF parser. Among many others.
Apache Tika – Apache Tika[^]
You would of course still need to code to each individual different format.
(Note that there are even image parsers.)
|
|
|
|
|
The title os this message might look like a joke, but it is actually very serious.
The other day I wanted to implement my own BigInteger class, and I was using byte s as the backing fields. Yet, I would only use values from 0 to 9 for each digit.
Bing AI suggested me to create an enum with values ranging from D0 to D9 (I think their actual values are obvious).
Yet, using an enum like that doesn't forbid users from doing things like (DecimalDigit)56 and pass 56 to an enum that was only supposed to support values ranging from 0 to 9.
Of course I can validate the values at run-time... but the entire purpose of using an enum was to avoid mistakes like that.
So, my solution was to create a class (in fact, a struct , but a class serves the same purpose) that has a private constructor, and has public static readonly fields ranging from D0 to D9. This way, users outside of the class, except if they really want to mess up (like using unsafe reflection) cannot pass values that aren't in the 0-9 range.
This also reminded me of a job where we had one enum with like 20 values... and then, many, many, many switches to get the many different traits of those enums.
Wouldn't it be better to just have classes, with all the traits, and use the classes?
Aside from the use of the enum in switch statements, they work the same in most cases, work even easier in cases where we usually had to use helper methods... and if a new trait is added, we have a single place (where the enum values are declared) to fix... with no chance of "forgetting" a case in a switch somewhere else.
What do you guys think?
Example:
public struct DecimalDigit
{
public static readonly DecimalDigit D0 = new(0);
public static readonly DecimalDigit D1 = new(1);
public static readonly DecimalDigit D2 = new(2);
public static readonly DecimalDigit D3 = new(3);
public static readonly DecimalDigit D4 = new(4);
public static readonly DecimalDigit D5 = new(5);
public static readonly DecimalDigit D6 = new(6);
public static readonly DecimalDigit D7 = new(7);
public static readonly DecimalDigit D8 = new(8);
public static readonly DecimalDigit D9 = new(9);
private DecimalDigit(byte value)
{
_value = value;
}
private readonly byte _value;
public byte ByteValue
{
get => _value;
}
}
public enum DecimalDigit:
byte
{
D0,
D1,
D2,
D3,
D4,
D5,
D6,
D7,
D8,
D9
}
Notice that although the enum version is smaller, if we need to add names for the values, in the class we just add a property, for the real enum, we create a helper method.
If we need to convert them to numbers, add an emoji or whatever, in the first version it is just a matter of adapting the class, while in the second it is a matter of creating more (and somewhat unrelated) methods.
Edit: I had some questions about why create a new decimal class. There is not a real need to create one. I just wanted to do it as an exercise. I can tell that .NET implemented BigInteger is way faster than my class. Yet, just by writing the UnsignedDecimalInteger I saw opportunities to write Quadbits (effectively, half of a hexadecimal value... or just 4 bits), so in one byte I can store 2 Quadbits. I also saw opportunities for caching of the internal buffers I use... and I am just "relearning" how to do math the "old way" using decimal values. I will, at some point, improve it to use 32 or 64 bits at once.
Also, one of the next steps, be it with BigInteger or my UnsigedDecimalInteger, is to create a BigDecimal or similar class. In fact, having a value alone (without caring about operations), I just need to have a value telling where the dot separating the integer part and the fractional part. Or, I can literally have two BigInteger (or similar), one for the left side, and one for the right side, of the decimal.
modified 1-Aug-23 18:09pm.
|
|
|
|
|
I use .IsDigit more often than defining what one is. Enums make code more readable. And can save storage.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
|
|
|
|
|
Your DecimalDigit struct should be marked as readonly , since it will never be mutated.
You could also replace the backing field with a read-only auto property.
You'll probably want to implement IEquatable<T> , and possibly IComparable<T> . And override ToString . At which point, it might be better to use a readonly record struct .
And after all that, without having to resort to reflection or unsafe code, you can still create an invalid instance:
ReadOnlySpan<byte> span = new byte[] { 42 };
DecimalDigit digit = System.Runtime.InteropServices.MemoryMarshal.Read<DecimalDigit>(span);
Console.WriteLine(digit.ByteValue);
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
Thanks, I forgot I can make the entire struct final.
Yet, I wouldn't get rid of the backing field... I don't like to make a class/struct use property getters instead of using their fields directly... that's really a personal choice.
In any case, I will mark the struct as readonly. Thanks for reminding me of that!
|
|
|
|
|
Question... if I have an array of my struct... the array itself is readonly... yet I can modify the contents of the array... should I mark the class readonly because I never replace one array by another, or not, as there are mutator methods to change the contents of the inner array?
|
|
|
|
|
You can't mark a class as readonly ; only a struct .
And a struct containing an array of struct s is probably a bad idea - the entire array would need to be stored inline as part of the outer struct s data.
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
The real class implements IEquatable and IComparable. I simply didn't want to make the code to huge for this discussion.
And, even though Marshal methods aren't marked as unsafe they are, by definition, unsafe. In fact, reflection itself is not called unsafe, although you can create empty instances of classes that do not define default constructors and the like...
|
|
|
|
|
Paulo Zemek wrote: to implement my own BigInteger class, Idea: add information to your post about why you want/need to do this.
What functionality does the MS BigInteger Struct provide that you need extended, modified, etc. [^]
? [^]
«The mind is not a vessel to be filled but a fire to be kindled» Plutarch
|
|
|
|
|
Skipping the class/struct/readonly/whatever discussions, and commenting on the subject line topic: Enums.
My programming childhood (i.e. as a university freshman) was with Pascal, which provided true enums. Not named integers. In the C language class, replacing integer #define with enum is just syntactical sugar. They are integers in disguise. Or not really disguise - it is just a very thin veil.
The very idea behind enums is that they are not integers, no matter what C programmers say. January, February and so on are months, not integers! May is May. It takes a C programmer to get into an argument whether the month of May "really" is 4 or 5. Well, the C programmer would never be in doubt. Also, half of May is March, that is obvious, isn't it? (Half of June is also March.)
Noooo! Enums are countable (enumerable, if you prefer). So are apples. That doesn't make an apple an integer.
If you are really talking about integers, they are integers. Replacing the digit 0 with 'D0', 1 with 'D1' and so on does not, not in any way whatsoever, 'improve legibility'. Quite to the contrary; it hides the fact that you are in fact talking about integer numbers. The reader has to look up the definition of 'D4' to see what it represents: Is it the binary integer 4, or is is the character code for the '4' digit character, or something quite different, such as 'dictionary entry with key 4'?
Enums have no place in this context of yours - not even in the C-style 'enums are named integers' style. You are handling true integers. Declare them as true integers, and nothing else.
(And: If Pascal had still been alive, you could have had exactly what you are trying to achieve defining new subrange type such as TYPE SingleDigitInteger = 0 ..9; and the compiler + runtime system would catch all attempts to set a variable/field of this type to a value outside its defined range. It would still be a numeric integer. Unfortunately, Pascal, and a lot of great ideas it represented, died several decades ago.)
|
|
|
|
|
You got it... the real purpose was to have a Range<0, 9>.
That's something I can easily do in C++ templates... and it would avoid enums altogether... and also classes that act as enums.
|
|
|
|
|