The Lounge - CodeProject

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch20-Sep-19 2:40

20-Sep-19 2:40

if I'm scraping a web page I'm worried about network performance, not regex performance

short of being on a SAN, which there'd be no reason to scrape except legacy integration, the network IO will outshadow any potential Regex performance issues by a large margin.

So I'm not worried about that.

Adding: If it really became an issue I could switch over to a non-backtracking engine like the one I wrote in C#

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

modified 20-Sep-19 8:50am.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

PeejayAdams20-Sep-19 3:00

PeejayAdams

20-Sep-19 3:00

I think there's a huge variation in performance between different regex engines.

Yes, it's never going to be lightning because of all the backward/forward matching going on but it can be a damned sight quicker than .NET would make it seem!

Whenever you find yourself on the side of the majority, it is time to pause and reflect. - Mark Twain

Re: just thought of kind of a cool idea that's simple but neat (maybe)

Slacker00720-Sep-19 2:54

Slacker007

20-Sep-19 2:54

Building a Web Scraper from start to finish - By[^]

I believe this is what you are looking for.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch20-Sep-19 2:56

honey the codewitch

20-Sep-19 2:56

ah, cool, so someone has done it before, just not quite the same way.

I think my solution is simpler. They're munging the objects using python.

I want to make it so you can define JSON objects with just the regex. It will use nested group catures to build the JSON hierarchy.

Different enough to satisfy me that it's worth it.

Thanks for the link.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

PIEBALDconsult20-Sep-19 3:12

PIEBALDconsult

20-Sep-19 3:12

Can it go arbitrarily deep?

And remember that .net's Regular Expression engine is far richer than most, so you may not be describing a generally-applicable technique.

I've been working at loading data from JSON files to SQL Server for only about a year now. I convert JSON to XML on-the-fly and pass the XML elements to SQL Server for further processing and storage -- using SQL Server's built-in XML functions.

In my situation, it's all about getting data from file to table as quickly as possible (with an eye toward not hogging resources) and I never need to have all the objects in memory at once.
Like you, IO seems to be the main bottleneck, with writing to the database being slower than reading from the disk.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch20-Sep-19 3:15

honey the codewitch

20-Sep-19 3:15

I'm targeting .NET so I'm not worried about it. If I were to port it to anything it would be something that at least used PCRE which is about as rich as .NET's regex, IIRC

But yeah, I'm looking at going arbitrarily deep. If I can't do it using nested group captures I'll do it by allowing you to define a pseudo-JSON document where each of the values is a regex expression instead of an actual value.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

dan!sh 20-Sep-19 3:47

dan!sh

20-Sep-19 3:47

What problem does this solve? I am sure people are scraping websites just fine based looking at posts on freelance work websites.

"It is easy to decipher extraterrestrial signals after deciphering Javascript and VB6 themselves.", ISanti[^]

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch20-Sep-19 4:29

honey the codewitch

20-Sep-19 4:29

It would present a JSON based facade you can apply over a non-webservice, traditional website, so you can basically front the website with a JSON REST service using some regex.

So say I declare some regex captures over Wikipedia so I can scrape encyclopedia information with search queries.

That site is then exposed as a REST service that I can use as though it was designed for that.

That's the basic idea anyway. I'm simplifying here as much as I can - in truth it might not scale to complexity. I'm still toying with the idea.

one of the things you can do with it is expose it to the client browsers so they can query that way, or you can handle it using JSON processing on the server, or whatever. maybe insert it into a mongoDB or something. There's lots of potential use cases out there, I'd think.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

musefan20-Sep-19 4:55

musefan

20-Sep-19 4:55

The question is... what workload are you trying to circumvent here?

It sounds like all you are doing is providing the result in a JSON format, as opposed to say a class object if you were using a .Net library to do the job, for example. Which isn't particularly useful as it's a very simple task to implement already.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch20-Sep-19 5:02

honey the codewitch

20-Sep-19 5:02

It would be part of the functionality of that JSON project i linked to so you are actually getting back class objects.

They're basically dictionary classes that support dynamic call sourcing ("expando" objects) that expose the fields as properties.

currently to do a rest call, you go something like

var args = new JsonObject();
args.Add("api_key", "c83a68923b7fe1d18733e8776bba59bb");
dynamic json = JsonRpc.Invoke("https://api.themoviedb.org/3/tv/2919", args);
Console.WriteLine(json.name); // writes "Burn Notice" to the console

and then you can do jsonpath queries or standard dictionary gets or dynamic calls (like above) on the object you get back. it's a dictionary with extra methods.

Or you can use any dictionary and get to the methods through static methods on JsonObject

There's lots of manipulation and querying features available, and features to help you create arrays of entities out of json arrays, etc.

This is the core library that I may add this support to.

So I'd add another way to do invocations using scraping and return the objects in the same manner.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

dan!sh 20-Sep-19 5:51

dan!sh

20-Sep-19 5:51

So, scrape website and show it as JSON to others. I am not aware of anyone who would need it. If I need to scrape something, I need latest content from the page and not something historic from a database. I can scrape anything I want because I know exactly what I want. I would not try to spend time on another service and try to make it understand what I want.

Now, creating a JSON formatted string from, lets assume, a string array is something I may need. But then again, my JSON schema is going to be more or less unique to me.

If at all I care about historic pages, waybackmachine is my good friend.

And if non of this makes sense, there are already a large number or services that offer website scraping. I found this through a quick search: [Parsers - Free web scraper - Parsers] (https://parsers.me/).

"It is easy to decipher extraterrestrial signals after deciphering Javascript and VB6 themselves.", ISanti[^]

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch20-Sep-19 5:54

honey the codewitch

20-Sep-19 5:54

you make several good points

lw@zi wrote:
Now, creating a JSON formatted string from, lets assume, a string array is something I may need. But then again, my JSON schema is going to be more or less unique to me.

Yeah i made sure you could do that and back again with my JsonArray class. =) It even has overloads for keyvaluepair generation since that's so common.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

Sander Rossel21-Sep-19 2:02

Sander Rossel

21-Sep-19 2:02

Have you played the Regex Crossword[^]? Big Grin | :-D

Best,
Sander

sanderrossel.com
Continuous Integration, Delivery, and Deployment
arrgh.js - Bringing LINQ to JavaScript
Object-Oriented Programming in C# Succinctly

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch21-Sep-19 3:23

honey the codewitch

21-Sep-19 3:23

I have not. I'll check it out.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch21-Sep-19 3:25

honey the codewitch

21-Sep-19 3:25

It's too easy.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

Sander Rossel21-Sep-19 3:27

Sander Rossel

21-Sep-19 3:27

Did you try Volapük and Hamlet?
I never got past the beginner crossword and I take pride in that Laugh | :laugh:

Best,
Sander

sanderrossel.com
Continuous Integration, Delivery, and Deployment
arrgh.js - Bringing LINQ to JavaScript
Object-Oriented Programming in C# Succinctly

Re: just thought of kind of a cool idea that's simple but neat (maybe)

honey the codewitch21-Sep-19 3:28

honey the codewitch

21-Sep-19 3:28

No i just clicked on tutorial, then experienced.

When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.

Re: just thought of kind of a cool idea that's simple but neat (maybe)

Sander Rossel21-Sep-19 3:39

Sander Rossel

21-Sep-19 3:39

Well, duh, of course the tutorial is easy D'Oh! | :doh:

I'm pretty sure the Hamlet will give you a proper challenge (but don't blame me if that takes up the rest of your day) Big Grin | :-D

Best,
Sander

sanderrossel.com
Continuous Integration, Delivery, and Deployment
arrgh.js - Bringing LINQ to JavaScript
Object-Oriented Programming in C# Succinctly

Man vs. Machine

PeejayAdams19-Sep-19 23:57

PeejayAdams

19-Sep-19 23:57

I was looking at this problem which is essentially a game known as Colonel Blotto

Each general must allocate a given number of troops across a given number of platoons which then battle against each other with results being decided on numerical supremacy. To win you need to both win the greatest number of battles and have the highest number of remaining troops.

So what we have is a zero-sum game which has twin objectives (somewhat like hi-lo poker variants but with a requirement to win both hands) with the additional complication of sequencing (a la rock-paper-scissors - e.g. if I set my troops out [19, 1, 19, 1, 19, 1, 19, 1, 19, 1], I might get a very different result than I would be sending out [1, 19, 1, 19, 1, 19, 1, 19, 1, 19] even though I'm using an identical strategy).

All of which gets me wondering - how well could we get a machine to play this game against a human opponent?

For the first element, calculating the optimal troop split, we could simply brute-force every possible combination, play them against each other and see what wins. Given the vast number of combinations involved, it's obviously a huge task but it's clearly do-able. We're essentially going to wind up with a ranked list of possible distributions). Let's assume that the results of these calculations will be revealed/knowable to the human opponent.

Now the human general has an advantage, he knows that the computer knows that the best distribution is distribution X and can easily counter (let's suppose the calculated "optimal" distribution is actually 10 platoons of 10 and the human knows that the computer will play the "best" distribution, the human simply plays 9 platoons of 11 and 1 of 0 and crushes him every time).

So the computer needs to throw a few curve balls - the easiest way to do this is to pick a random distribution from the top n distributions. The computer could further obfuscate by applying random increments and decrements to various pairs within the distribution or whatever.

Similarly when it comes to ordering the platoons, the computer's most effective approach might just be to randomise the whole distribution array, maybe it can work a more intricate algorithm based on previous distributions employed by the human general.

The human of course, has all the same curve balls at his disposal.

Machine learning is definitely on the menu and time is no object. All previous games are stored and available to both players for analysis. The human has all the necessary database skills and tools to view the data as meaningfully as possible.

The human general is so fond of Colonel Blotto and proud to be human that he is happy to play against the machine a million times over if necessary in order to prove that humans are smarter than tin.

Will he win?

Whenever you find yourself on the side of the majority, it is time to pause and reflect. - Mark Twain

Re: Man vs. Machine

lopatir20-Sep-19 0:10

lopatir

20-Sep-19 0:10

PeejayAdams wrote:
The human general is so fond of Colonel Blotto and proud to be human that he is happy to play against the machine a million times over if necessary in order to prove that humans are smarter than tin.

Will he win?

If the human is like me he has one huge advantage:

a bloody big hammer!

(well 2 advantages if you include my strength ... of rage.)

Message Signature
(Click to edit ->)

Re: Man vs. Machine

PeejayAdams20-Sep-19 0:20

PeejayAdams

20-Sep-19 0:20

Fair point, I'll rephrase the question:

Will there come a point where the human general has to resort to extreme violence or praying that the cleaner unplugs the computer?

Whenever you find yourself on the side of the majority, it is time to pause and reflect. - Mark Twain

Re: Man vs. Machine

Slacker00720-Sep-19 1:28

Slacker007

20-Sep-19 1:28

PeejayAdams wrote:
human ~~general has to resort to~~ extreme violence

Every human on earth has the propensity for evil and violence.

so yes, the human general has to resort to extreme violence; it is his nature.

Re: Man vs. Machine

CodeWraith20-Sep-19 2:05

CodeWraith

20-Sep-19 2:05

Slacker007 wrote:
Every human on earth has the propensity for evil

The true question here is what kind of evil. Chaotic evil, neutral evil or lawful evil.

I have lived with several Zen masters - all of them were cats.

His last invention was an evil Lasagna. It didn't kill anyone, and it actually tasted pretty good.

Re: Man vs. Machine

Slacker00720-Sep-19 2:57

Slacker007

20-Sep-19 2:57

Well evil is relative in regards to "type", but the only constant is that evil is the result of negative energy. So does it really matter what kind it is? No, of course not.

Re: Man vs. Machine

musefan20-Sep-19 1:06

musefan

20-Sep-19 1:06

If there were no random elements to the machine's strategy then I would be tempted to say the human could win over time. As if they are able to identify the machine's algorithm, they could potentially predict future moves.

However, given then random element, the human ultimately has no idea what the computer will play. Therefore in order to have a chance of winning, the human must also play random guesses.Therefore I believe the outcome will ultimately be based on luck alone, and could unpredictably go either way.

Having said that, the psychological element of the human could effect the true 'randomness' of their placements. For example, they may subconsciously implement a pattern to their randomness without realising it, and if the machine has been implemented to pick up on that, it could steal the upper hand. So in this scenario (where the human also plays random), I would probably choose machine.

But... the wiki link suggests that there are strategies to avoid losing. In which case the human can simply play that strategy every time, and because they know the machine won't play it. The human will always either draw, or win. If that strategy is permitted by the rules though (i.e. playing the same distribution each game), then the machine would be stupid not to also play that same strategy every game. So all games would be a draw, and therefore no winner.

So bottom line, I think it really depends on how prepared the human is before starting, and very much how the machine is programmed to play.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Welcome to the Lounge